On the elusiveness of clusters 
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Abstract. Rooted phylogenetic networks are often used to represent conflicting phyloge- 
netic signals. Given a set of clusters, a network is said to represent these clusters in the 
softmred sense if, for each cluster in the input set, at least one tree embedded in the net- 
work contains that cluster. Motivated by parsimony we might wish to construct such a 
network using as few reticulations as possible, or minimizing the level of the network, i.e. 
the maximum number of reticulations used in any "tangled" region of the network. Although 
these are NP-hard problems, here we prove that, for every fixed k > 0, it is polynomial-time 
solvable to construct a phylogenetic network with level equal to k representing a cluster 
set, or to determine that no such network exists. However, this algorithm does not lend 
itself to a practical implementation. We also prove that the comparatively efficient CASS 
algorithm correctly solves this problem (and also minimizes the reticulation number) when 
input clusters are obtained from two not necessarily binary gene trees on the same set of 
taxa but does not always minimize level for general cluster sets. Finally, we describe a new 
algorithm which generates in polynomial-time all binary phylogenetic networks with exactly 
r reticulations representing a set of input clusters (for every fixed r > 0). 



1 Introduction 

The traditional abstraction for modeling evolution is the phylogenetic tree. The underlying princi- 
ple of such a tree is that the observed diversity in a set of species (or, more abstractly a set of taxa) 
can be explained by branching events that cause lineages to split into two or more sublineages 
[211617] . However, there is increasing attention for the situation when observed data cannot satis- 
factorily be modeled by a tree. The field of phylogenetic networks has arisen with this challenge in 
mind. Phylogenetic networks generalize phylogenetic trees, but within this very general character- 
ization there are many different definitions and models |12) . In this article we are concerned with 
rooted phylogenetic networks. Such networks assume that the observed data evolves from a unique 
starting point (the root) and that evolution is directed away from this root. The main way these 
networks differ from rooted phylogenetic trees is the presence of reticulation nodes: nodes with 
indegree 2 or higher. For the remainder of this article we will use the term phylogenetic network, 
or just network, to refer to rooted phylogenetic networks. We refer the reader to |12ll9|20|ll|25j 
for detailed background information. 

Constructing a phylogenetic network that "explains" the observed data is a trivial problem if 
no optimality criteria are imposed upon the constructed network. One simple optimality criterion 
that has attracted a great deal of attention in the literature, reticulation minimization, is to com- 
pute a phylogenetic network that explains the observed data but using as few reticulation events 
(essentially, reticulation nodes) as possible. This is an algorithmically hard problem, irrespective 
of the exact construction technique that is being applied |25j . A related optimality criterion, level 
minimization 16 15 24 26 25] is motivated by the observation that a phylogenetic network can be 
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regarded as some kind of tree backbone decorated with tangles of reticulate activity [12] • Here the 
challenge is to construct a phylogenetic network that explains the observed data but such that the 
maximum number of reticulation events inside any biconnected component, the level, is as low as 
possible. Reticulation minimization is thus a global optimality criterion, and level minimization is 
in some sense a local optimality criterion; see Figure [T] Both criteria will have an important role 
in this article. 
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Fig. 1. Example of a phylogenetic network with five reticulations. The encircled subgraphs form 
its biconnected components, also known as its "tangles". This binary network has level equal to 2 
since each biconnected component contains at most two reticulations. 

The question remains, when does a phylogenetic network explain the observed data? This 
depends very much on the exact construction technique being applied. The classical problem is 
motivated by the biological observation that, although the evolution of a set of organisms might 
best be explained by a phylogenetic network, the individual genes of the organisms will generally 
undergo treelike evolution 19J. In such a case one can think of the gene trees as being displayed 
by (i.e. topologically embedded within) the species network. The reticulation nodes then have an 
explicit biological interpretation as (for example) hybridization, recombination or horizontal gene 
transfer events. Hence the following problem: given a set of rooted phylogenetic trees, all on the 
same set of taxa, compute a phylogenetic network with a minimum number of reticulations that 
displays all the input trees. The problem is already NP-hard (and APX-hard) for the case of two 
input trees [3]. However, extensive research by different authors has shown that, by exploiting the 
fixed parameter tractability of the problem |l|2j , the two-tree problem can be solved to satisfaction 
for many instances |4l31j . The case of more than two input trees, or input trees that are not all 
binary (i.e. some nodes have outdegree three or higher), has been considerably less well studied 
[25H31 . 

A parallel, and related, line of research concerns "piecewise" assembly of phylogenetic networks. 
Whereas the tree problem described above concerns the combination of a small number of large 
hypotheses (e.g. gene trees) into a phylogenetic network, an alternative strategy is to combine a 
large number of small hypotheses into a phylogenetic network. Examples of such small hypotheses 
include rooted triplets (phylogenetic trees defined on size-3 subsets of the taxa) |28l26j . (binary) 
characters (e.g. whether or not the taxon is vertebrate) [8 9 30 and clusters (clades) |10llll27j . 
Proponents of such piecewise assembly techniques argue that in this way it is easier (than with 
trees) to discard parts of the input that are not well-supported. In this article we focus specifically 
on clusters, although the classical tree problem and all the other piecewise construction techniques 
do play a secondary role. This secondary role is linked to the fact, as observed in [25] , that under 
certain circumstances all these different models behave in a unified way. We shall return to this 
point later. 

Let us then say more about the cluster model. A cluster C is a subset of the taxa and we say 
that a phylogenetic network represents the cluster in the softwired sense if some tree embedded 
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Fig. 2. (a) The output of the galled network algorithm [TJJ for C — {{a, b, /, g, i}, {a, 6, c, /, g, i}, 
{a, b, f, i}, {b, c, f, i}, {c, d, e, h}, {d, e, h}, {b, c, /, h, i}, {b, c, d, f, h, i}, {b, c, i}, {a, g}, {b, i}, {c, i}, 
{d, h}} and (b) a (simple) network with two fewer reticulations that also represents this set of 
clusters. 

in the network contains a clade equal to that cluster In other words: some tree T embedded 
in the network has an edge such that C is exactly the set of all taxa reachable from the head 
of that edge by directed paths. The general problem is, given a set of clusters, to construct an 
optimal phylogenetic network that represents all the input clusters. The set of input clusters 
can be constructed in an ad-hoc fashion, but often the set of clusters is generated by extracting 
the set of clusters induced by a set of rooted phylogenetic trees and then possibly excluding 
weakly supported clusters. This is the technique applied in the program Dendroscope [Tj5]. A 
disadvantage of this technique is that some of the topology of the original trees can be lost j^S] , 
but on the plus side it permits a focus on only well-supported clades, which is a major concern of 
practicing phylogeneticists. 

In it was shown, given a set of clusters, how to construct a galled network with a small 
number of reticulations that represents the clusters. However, given that galled networks are a 
restricted subclass of phylogenetic networks, it was unclear how far that algorithm actually mini- 
mizes the number of reticulations (or the level) when ranging over the entire space of phylogenetic 
networks, see for example Figure [2] This is the context in which the Cass algorithm was developed 
[27] . The Cass algorithm, in some sense a natural follow-up to the algorithm of [TT] , was formally 
shown to produce solutions of minimum level whenever the minimum level is at most two. How- 
ever, the optimality of the Cass algorithm for "higher-level" inputs, and the performance of the 
algorithm in terms of minimizing number of reticulations, remained unclear. On the practical side 
the good news was that Cass produced solutions with fewer reticulations and lower level than 
the algorithm from [TT]. Intriguingly it was also observed in [27] that, for several sets of input 
clusters induced by two binary trees, the networks produced by Cass had an identical number of 
reticulations to networks generated by algorithms that aim to display the trees themselves. This 
observation was the inspiration behind [25] in which it was proven that, in the case of two binary 
trees, the choice of construction technique (tree, triplets, characters, clusters) does not affect the 
number of reticulations required. However, this unification was shown to break down for data 
obtained from three or more binary trees. 

After [25] several important questions about clusters remained open. Does Cass always mini- 
mize level? If not, can we find a different algorithm that efficiently minimizes level? Under which 
circumstances does Cass also minimize the number of reticulations? In how far do the unification 
results of [25 hold for non-binary trees? 

The results in this article settle many of these open questions. Firstly, we show that in the case 
of clusters obtained from two not necessarily binary trees, a divide and conquer algorithm using 
Cass as subroutine, called here CASS DC and implemented in Dendroscope [T3], does minimize 
both the number of reticulations, and the level. Spin-off results from this include non-binary 
versions of several unification results from [25], culminating in the observation that, in the case 
of clusters obtained from two trees, Cass dc also computes the minimum number of reticulations 



required to display the trees themselves (in the sense of Q2]), rather than just the clusters from the 
trees. We also obtain deeper insights into why the two-tree case is so special and not representative 
for the problem on three or more trees. In particular, the two-tree case seems to be best understood 
as the only point at which a very natural lower bound is guaranteed to be tight. 

Secondly we show that Cass does not, unfortunately, always minimize level when the input 
data requires solutions of level 3 or higher. We give an explicit counterexample and explain what 
goes wrong with the Cass algorithm in this case. 

To offset this negative result we describe a polynomial-time algorithm that shows, for every 
fixed natural number fc, how to determine whether a set of clusters can be represented by a 
network with level k. This algorithm, which is very different to Cass, is purely theoretical but 
does give important insights into the underlying structure of the cluster model. Also on the positive 
side we show that, for sets of clusters induced by arbitrarily large sets of binary trees, a simple 
polynomial-time algorithm can construct all binary phylogenetic networks with r reticulations that 
represent the clusters, for every fixed r > 0. To demonstrate an important design principle first 
observed in [25] we give a practical implementation of this algorithm, Clustistic, which elegantly 
"bootstraps" an existing software package for merging rooted triplets into a phylogenetic network. 

To summarize, the results in this article help advance our understanding of the cluster model 
considerably. Nevertheless, many questions remain, and in the final section of this article we discuss 
a number of them. Perhaps the biggest question, which is the motivation for the title of this article, 
concerns the fact that the cluster model so far has not enjoyed the same kind of steady algorithmic 
improvements witnessed in the tree literature. Why does it seem harder to work with the clusters 
inside the trees than the trees themselves? To what, exactly, can the elusiveness of clusters be 
attributed? 

2 Preliminaries 

Consider a set X of taxa. A rooted phylogenetic network (on X), henceforth network, is a directed 
acyclic graph with a single node with indegree zero (the root), no nodes with both indegree and 
outdegree equal to 1, and leaves bijectively labeled by X. In this article we identify the leaves 
with X. The indegree of a node v is denoted S~(v) and v is called a reticulation if 5~(v) > 2. An 
edge (u, v) is called a reticulation edge if its target node v is a reticulation and is called a tree edge 
otherwise. When counting reticulations in a network, we count reticulations with more than two 
incoming edges more than once because, biologically, these reticulations represent several reticulate 
evolutionary events. Therefore, we formally define the reticulation number of a network TV = (V, E) 
as 

r(N)= ]T (S-(v)-l) = \E\-\V\ + l . 
ve.v-.6~ 0)>o 

A rooted phylogenetic tree on X , henceforth tree, is simply a network that has reticulation 
number zero. We say that a network TV on X displays a tree T if T can be obtained from N by 
performing a series of node and edge deletions and eventually by suppressing nodes with both 
indegree and outdegree equal to 1. We assume without loss of generality that each reticulation has 
outdegree at least one. Consequently, each leaf has indegree one. We say that a network is binary 
if every reticulation node has indegree 2 and outdegree 1 and every tree node that is not a leaf 
has outdegree 2. 

Proper subsets of X are called clusters, and a cluster C is a singleton if |C| = 1. We say that 
an edge (u, v) of a tree represents a cluster C C X if C is the set of leaf descendants of v. A tree 
T represents a cluster C if it contains an edge that represents C. It is well-known that the set 
of clusters represented by a tree is a laminar set, and uniquely defines that tree. We say that a 
network N represents a cluster C C X "in the hardwired sense" if there exists a tree edge (u, v) 
of N such that C is the set of leaf descendants of v. Alternatively, we say that N represents C "in 
the softwired sense" if N displays some tree T on X such that T represents C. In this article we 
only consider the softwired notion of cluster representation and henceforth assume this implicitly. 
A network represents a set of clusters C if it represents every cluster in C (and possibly more). The 



set of softwired clusters of a network can be obtained as follows. For a network N, we say that 
a switching of N is obtained by, for each reticulation node, deleting all but one of its incoming 
edges. Given a network N and a switching T/v of N, we say that an edge (it, v) of N represents a 
cluster C w.r.t. Tv if (it, v) is an edge of T/v and C is the set of leaf descendants of u in T/v- The 
set of softwired clusters of N is the set of clusters represented by all edges of N w.r.t. T N , where 
T/v ranges over all possible switchings [T2]. It is also natural to define that an edge (u,v) of N 
represents a cluster C if there exists some switching T/v of N such that (it, i>) represents C w.r.t 
T/v- Note that, in general, an edge of N might represent multiple clusters, and a cluster might be 
represented by multiple edges of N. 

Given a set of clusters C on X, throughout the article we assume that, for any taxon x G X, 
C contains at least one cluster C containing x. For a set C of clusters on X we define r(C) as 
mm{r(N)\N represents C}, we sometimes refer to this as the reticulation number of C. The related 
concept of level requires some more background. A directed acyclic graph is connected (also called 
"weakly connected") if there is an undirected path (ignoring edge orientations) between each 
pair of nodes. A node (edge) of a directed graph is called a cut-node (cut-edge) if its removal 
disconnects the graph. A directed graph is biconnected if it contains no cut-nodes. A biconnected 
subgraph B of a directed graph G is said to be a biconnected component if there is no biconnected 
subgraph B' 7^ B of G that contains B. A phylogenetic network is said to be a level- < k network 
if each biconnected component has reticulation number less than or equal to k^ A level- < k 
network is called a simple level-< k network if the removal of a cut-node or a cut-edge creates two 
or more connected components of which at most one is non-trivial (i.e. contains at least one edge). 
A (simple) level- < k network N is called a (simple) level-k network if the maximum reticulation 
number among the biconnected components of N is precisely k. For example, the network in Figure 
[TJ is a level-2 network while the one in Figure 2(a) is a simple level-4 network. Note that a tree is 



a level-0 network. For a set C of clusters on X we define 1(C), the level of C, as the smallest k > 
such that there exists a level-fc network that represents C. It is immediate that for every cluster 
set C r(C) > 1(C), because a level-A: network always contains at least one biconnected component 
containing k reticulations. 

We say that two clusters C\, C% C X are compatible if either CilHC^ = or C\ C C2 or C2 C Ci. 
Consider a set of clusters C. The incompatibility graph IG(C) of C is the undirected graph (V, E) 
that has node set V = C and edge set E = {{C±, C2} | C\ and C2 are incompatible clusters in 
C}. We say that a set of taxa X' C X is separated (by C) if there exists a cluster C £ C that is 
incompatible with X' , and unseparated otherwise. 

We say that a set of clusters C on X is separating if it separates all sets of taxa X' such that 
X' C X and \X'\ > 2. We say that a set of clusters C on X is tangled [T2] if: 

1. IG(C) is connected and has more than one node; 

2. every pair of taxa x and y in X is separated by C. 

Remember that here we assume that any taxon of X is contained in at least one cluster C 6 C. 
Then, it can easily be verified that, given a tangled set of clusters C on X, C is separating. 

The incompatibility graph and the concept of tangled clusters are important because they 
highlight an important difference between (the computation of) r(C) and t(C). In [S7J the authors 
show that, if 1(C) = k, then a level-fc network that represents C can be constructed by combining 
in polynomial time simple level- < k networks constructed independently for each connected com- 
ponent of IG(C). The actual procedure is slightly more involved but it shows in any case that a 
polynomial-time algorithm for constructing simple level-< k networks can easily be extended to 
a polynomial-time algorithm for constructing level- < k networks. We will make use of this fact in 
Section [3l 

Unfortunately, as has been observed by several authors, the same procedure does not necessarily 
lead to networks that have reticulation number r(C). In other words, computation of r(C) requires 
something more complicated than independently optimizing each connected component of IG(C). 
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An important special case, however, is when C is separating; in this case any network N that 
represents C is simple (or can be trivially modified to become simple) and r(C) — ((C). We will 
formalize this in due course. 

To conclude the preliminaries we note that, throughout the article, we often write that an 
algorithm is "polynomial time" without formally specifying what the input size is. Unless otherwise 
specified the input is a set of clusters C on taxa set X. It is sufficient to take \C\ + \X\ as a lower 
bound on the size of the input. In some cases (such as Lemma [3] in Section [3]) \C\ is at most a 
constant factor larger than \X\ and then it is sufficient to prove a running time polynomial in \X\. 
In other cases a running time of the form 0(|C| a is obtained, for constants a and b, and this 
is clearly polynomial in \C\ + \X\ because (\C\ + \X\) 2 > \C\\X\. 

2.1 Structure of the article 

To facilitate the mathematical exposition we build the results of this article up in a specific order, 
which differs from the order presented in the introduction. We begin with Section [3j A theoretical 
polynomial-time algorithm for constructing level-k networks, where we prove that, for every fixed 
k > 0, the problem of determining whether a level-fc network that represents C exists (and if so to 
construct such a network) is solvable in polynomial time. This section is, compared to the rest of 
the article, comparatively self-contained. In Section [4} From theory to practice: the importance of 
ST-sets we describe several fundamental properties of the ST-set, a special structure that plays a 
central role throughout the rest of the article. In Section [5j Clusters obtained from sets of binary 
trees on X we show how, given a set T of binary trees on X, and for each fixed r > 0, it is possible 
to construct in polynomial time all binary phylogenetic networks with reticulation number r that 
represent all the clusters in the input trees. We also describe Clustistic, which is our implemen- 
tation of this algorithm built on top of already-existing software. In Section [6] Witnesses and a 
natural lower bound we further develop the theory surrounding the ST-set, explicitly relating it to 
the computation of reticulation number. This is used extensively in Section [7J The optimality and 
non-optimality of Cass where we give both positive and negative results for the Cass algorithm, 
and in the process develop a number of powerful generalizations of unification results from |25j . 

3 A theoretical polynomial-time algorithm for constructing level-fc 
networks 

In this section we prove that, for every fixed k > 0, the problem of determining whether a level-fc 
network exists that represents C, and if so to construct such a network, can be solved in polynomial 
time. 

We first require some auxiliary lemmas and definitions. The proofs are rather technical so we 
defer them to the appendix. For a node v let X(v) C X be the set of all taxa reachable from v by 
directed paths. For an edge e = (u,v) we define X(e) to be equal to X(v). 

Observation 1. LetC be a separating set of clusters on X. Let N be any network that represents 
C. Then each node of N has at most one leaf child and for each cut-edge (u,v) in N, \X(v)\ = 1 
orX(v) = X. 

Proof. Deferred to the appendix. □ 

Lemma 1. Let C be a separating set of clusters on X . Let N be any network that represents C. 
Then there exists a simple network N* with at most one leaf-child per node such that £(N*) < ((N). 

Proof. Deferred to the appendix. □ 

Observation [l] and Lemma [I] formalize the idea that any network that represents a separating 
set of clusters is simple or can easily be made simple by deleting certain redundant parts of it. The 
following lemma shows that, in terms of minimizing reticulation number or level, we can assume 
without loss of generality that networks are binary. 
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Fig. 3. The single level-1 generator and the four level-2 generators. Here the sides have been 
labelled with capital letters. 
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Fig. 4. An example of the execution of the algorithm outlined in Lemma [3] for the separating 
set of clusters C = {{a, b}, {a, c}, {c, d}, {d, e}, {a, b, c}, {c, d, e}, {d, e, /}, {c, rf, e, /}, {b, d, e, /}, 
{&, c, d, e, /}, {a}, {&}, {c}, {<i}, {e}, {/}} on X = {a,b,c,d,e,f}. The value of fc is fixed to 2. (a) 
The chosen generator g is the generator 2c in Figure [3] (b) We guess that the sides E, H and 
G contain one leaf while the side F contains more than 2 leaves and all other sides contain zero 
leaves, (c) We guess the single leaves on sides E, H and G. (d) We guess the leaves s + and s~ on 
side F. (e) We deduce the last leaf in F and we obtain a simple level-2 network on X. By a stroke 
of luck, our first guess represents C. 



Lemma 2. Let N be a phylogenetic network on X. Then we can transform N into a binary 
phylogenetic network N' such that N' has the same reticulation number and level as N and all 
clusters represented by N are also represented by N' . 

Proof. Deferred to the appendix. □ 

Note that, given a simple binary network N, each node of N has at most one leaf child. Armed 
with these technical results we are ready to prove the main result of this section. 

Lemma 3. Let C be separating set of clusters on X. Then, for every fixed k > 0, it is possible 
to determine in polynomial time whether a level-k network exists that represents C, and if so to 
construct such a network. 

Proof. From Lemmas [l] and [2] it is sufficient to focus on simple binary networks. We assume then 
that, for fixed fc, there exists a binary simple level- A; network TV that represents C. Let \X\ = n. 
Then C will contain at most 2 k+1 (n ~ 1) clusters, because there are at most 2 k trees displayed 
by a simple level-fc network, and each tree represents at most 2(n — 1) clusters. Thus, for fixed k, 
the size of the input is polynomial in n. It follows from these observations that whether a set of 
clusters is represented by a given simple level-fc network can be checked in polynomial time. 

It is known that, if the leaves of N are removed and all nodes with both indegree and outdegree 
equal to 1 are suppressed, the resulting structure will be a level-fc generator, defined in [24] . See 
also Figure [3] For fixed fc, there are only a constant number of level-fc generators Proposition 
2.5]. Recall that the sides of a level-fc generator are defined as the union of its edges and its nodes 



of indegree-2 and outdegree-O. For fixed fc the maximum number of sides ranging over all level-fc 
generators, is a constant. 

For a cluster set C on X, we write x — > y if and only if every non-singleton cluster in C that 
contains x, also contains y. For example, for the cluster set of Figure [IJ we have e — > d and / — > e. 

In the remainder of the proof, we illustrate a simple algorithm for determining whether a binary 
simple level-fc network that represents C exists by attempting to reconstruct such a network. Let 
g be the generator underlying N. We only require polynomially many tries to compute g, because 
there are only a constant number of generators. So assume we know g. For each side of g, we guess 
whether there are 0, 1, 2 or more than 2 leaves on that side. For each side containing exactly one 
leaf, we guess what that is. For each side s of g containing 2 or more leaves, we guess the leaf ,s + 
that is nearest to the root on that side, and the leaf s~ that is furthest from the root on that side. 
For an example see Figure [4] Note that, since for each side we have only four options (0, 1, 2 or 
more than 2 leaves), and in the latter case only s + and s~ have to be chosen, if follows that we 
have a polynomial number of guesses to try. 

We will now show how to add the remaining leaves. Note that we may fail to insert all leaves 
in the network. This means that we made the wrong guess and that another set of guesses has 
to be checked. We say that a side s is lowest if it does not yet have all its leaves, and there is 
no other such side s' reachable from s. By reachable we mean that in the underlying generator g, 
there is a directed path from the head of side s to the tail of side s' . Since N is a directed acyclic 
graph, until all leaves in X have been added, there will always be a lowest side. For example, the 
side F in Figure |l]jd) is lowest. The idea is to add leaves to the lowest side s, until all its leaves 
have been added. We then continue with remaining lowest sides until we have reconstructed N . 

Given a lowest side s, with s + and s~ fixed, it is possible to tell in polynomial time what the 
correct remaining leaves for s are, as follows. Observe that a leaf x that is on side s in N and which 
has not yet been added has the property s + — > x — > s~. Furthermore, there is at least one cluster 
C € C such that {x, s + , s~}HC = {x, s~}. There exists at least one such cluster because otherwise 
{x, s + } would be not separated in C, a contradiction since C is separating. We call such a cluster 
a split cluster for side s. Now, observe that for every split cluster C for side s, and for every side 
t ^ s that contains 2 or more leaves in N, either {f+,r}nC= {t+, t~} or {t + , r}nC = 0. This 
follows because the only edges in N that represent C lie on side s. If this is not the case, our set 
of guesses was incorrect and a new one has to be checked. 

Now, consider any leaf y that has not yet been added to the network. Assume that this leaf 
belongs to side t for some t. We want to have a simple test to avoid wrongly placing it on side s, 
with s 7^ t. Side t will contain three or more leaves in N, so we can assume that t + and t~ exist. If 
s + — > y — > s~ does not hold then it is immediately clear that y cannot be put on side s. So assume 
(conversely) that this condition does hold, and for the same reason assume there is a split cluster C 
for side s that contains y. In other words, there is a cluster C such that {y, s + , s~} DC = {y, s~}. 
Since t + — s- y — s- t~ holds, it follows that C also contains t + and t~ , because any cluster that 
contains y also contains t~ , and we know that C contains either both of t + and t~ , or neither of 
them. However, there is no edge in N that can represent C: the only edges that represent C lie on 
side s, but the fact that s is the lowest side means that no cluster beginning on side s can contain 
any leaves on side t. To summarize, we have a simple test for determining whether a leaf should 
be placed on side s. Once we have determined the set of leaves that should be placed on side s, 
it is easy to determine the correct order of those leaves by inspecting — » relationships. Indeed, 
if q and p are two leaves that belong to side s, and q is nearer to s~ in the network we aim to 
reconstruct, then obviously p — > q. Since C is separating there will be (by separation) some cluster 
that contains q but not p, so q p. If all leaves can be added in such a way, we obtain a simple 
level-fc network on X and we can check in polynomial time if it represents C. If it is not the case 
or we fail to insert at least one leaf in the network, another set of guesses can be checked until a 
simple level-fc network representing C (if any exist) is found. This concludes the proof. □ 

The following corollary follows automatically from the generality of the proof of Lemma [3] 

Corollary 1. Let C be a separating set of clusters on X . Then, for every fixed k > 0, it is possible 
to construct in polynomial time all binary simple level-k networks that represent C . 
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Fig. 5. A phylogenetic tree T (a) and a phylogenetic network N (b,c,d); (b) illustrates in red 
that N displays T (deleted edges are dashed); (c) illustrates that N is consistent with (amongst 
others) the triplet cd\f (deleted edges are again dashed); (d) illustrates that N represents (amongst 
others) cluster {c,d,e} in the softwired sense (dashed reticulation edges are "switched off"). 

Using Lemma [3] we can prove the following result: 

Theorem 1. Let C be a (not necessarily separating) set of clusters on X. Then, for every fixed 
k > 0, it is possible to determine in polynomial time whether a level-k network exists that represents 
C, and if so to construct such a network. 

Proof. Recall that all tangled cluster sets are separating. It was shown in [27] that the existence 
of a polynomial-time algorithm for constructing a level- < k network from a tangled cluster set, 
is sufficient to give a polynomial-time algorithm for constructing level- fc networks from general 
cluster sets. (Specifically, several tangled cluster sets are obtained by processing each non-trivial 
connected component of the incompatibility graph of the original cluster set |27ll2j ). Hence we 
can assume without loss of generality that C is tangled. Lemma [3] is thus sufficient, and we are 
done. 

□ 

3.1 Rooted triplets 

It is interesting to note that the proof technique used in Lemma [3] leads to a simplified proof, 
presented in the following corollary, of a complexity result that was first proven in |28) . (The 
algorithm in |28j yielded a much faster running time, however). Let us first recall several definitions 
related to rooted triplets. A (rooted) triplet on A" is a binary phylogenetic tree on a size-3 subset 
of X. We use xy\z to denote the triplet with taxa x, y on one side of the root and z on the other 
side of the root. For triplets, the notion of "represent" can be formalized by the notion of "display" 
introduced above. However, for triplets "consistent with" is often used instead of "displayed by" . 
A triplet xy\z is consistent with a phylogenetic network N (and N is consistent with xy\z) if xy\z 
is displayed by N. See Figure [5] for an example. Given a phylogenetic tree T on X , we let Tr(T) 
denote the set of all rooted triplets on X that are consistent with T. For a set of phylogenetic 
trees T, we let Tr{T) denote the set of all rooted triplets that are consistent with some tree in T, 
i.e. Tr(T) = {J TeT Tr(T). A set of triplets on X is dense if, for every size-3 subset {x,y, z} C X, 
at least one of xy\z, xz\y, yz\x is in the triplet set. 

Corollary 2. Let R be a dense set of triplets on X . Then, for every fixed k > 0, it is possible 
to determine in polynomial time whether a binary simple level-k network exists that is consistent 
with R, and if so to construct such a network. 

Proof. As pointed out in |28) it is possible to determine in polynomial time whether a given 
network is indeed consistent with a set of input triplets. Then, the proof of Lemma [3] holds here 
almost entirely. The only significant difference concerns the adding of leaves to the lowest side: a 
not yet allocated leaf x belongs on lowest side s if and only if the triplet s~x\s + is in the input. □ 

We shall return to rooted triplets again later in the article. 



4 Prom theory to practice: the importance of ST-sets 



The algorithm described in Section [3] is polynomial time but only of theoretical interest because 
its running time is too high to be useful in practice. In the rest of this article we will focus on 
practical polynomial-time algorithms. In all these algorithms the ST-set, which can informally 
be thought of as treelike subsets of X, has a central role. We begin by formally defining ST-sets 
and describing their basic properties. We will expand upon these basic properties in subsequent 
sections of the article. 

4.1 Definition and basic properties of ST-sets 

Given a set S C X of taxa, we use C\S to denote the result of removing all elements of S from each 
cluster in C and we useC|5 to denote C \ (X \ S) (the restriction of C to S). We say that a set S C X 
is an ST-set with respect to C, if S is not separated by C and any two clusters C\,C-2 € C\S are 
compatible. Note that, unlike in [57], we allow the possibility that S = or S — X. (We say 
that an ST-set S is trivial if S = or S = X). An ST-set S is maximal if there is no ST-set T 
with S C T. 

Informally, the maximal ST-sets are the result of repeatedly collapsing pairs of unseparated 
taxa for as long as possible; we can think of them as "islands of laminarity" within the cluster set. 
ST-sets first explicitly appeared in [57J but, as we shall see in due course, they implicitly arose 
earlier in the recombination network literature. An important feature of ST-sets is that there can 
in general be very many of them. For example, suppose C contains only \X\ = n singleton clusters; 
then C has 2 n ST-sets. However, as the following technical results show, C will have at most n 
maximal ST-sets, and they will partition X i.e. they are mutually disjoint and entirely cover X. 
Several of the proofs have been deferred to the appendix. 

Lemma 4. Let C be a set of clusters on X and let S\ ^ S2 be two ST-sets of C. If Si D S2 ^ 
then Si U 5*2 is an ST-set. 

Proof. Deferred to the appendix. □ 
Corollary 3. LetC be a set of clusters on X and let Si ^ S2 be two maximal ST-sets of C. Then 

Sins 2 = 0. 

Corollary 4. Let C be a set of clusters on X. Then there are at most n maximal ST-sets with 
respect to C, they are uniquely defined and they partition X . 

Proof. The disjointness of maximal ST-sets guarantees that there at most n of them. Consider 
the set S of all (necessarily disjoint) maximal ST-sets of X . Suppose the maximal ST-sets in S 
do not entirely cover X . Then there is some x G X which is disjoint from all maximal ST-sets in 
iS. Let S be the ST-set of largest cardinality that contains x; such an ST-set must exist because 
{x} is an ST-set. S £ S so there exists some ST-set S' such that S C S' , but this contradicts the 
assumption on the cardinality of S. Hence S partitions X, and by extension S is unique. □ 

Lemma 5. The maximal ST-sets of a set of clusters C on X can be computed in polynomial time. 

Proof. Deferred to the appendix. □ 

Corollary [4] and Lemma [5] are perhaps not so surprising, but we have nevertheless proven them 
rigorously to highlight the fact that computing maximal ST-sets is not a complexity bottleneck. 
In later sections we shall see that there is a link between NP-hardness and maximal ST-sets, but 
that the hardness lies in selecting certain maximal ST-sets with special properties, not in the 
computation of the maximal ST-sets per se. 

Let T be a set of trees, where each T £ T is a tree on X. For a tree T we write Cl(T) to denote 
the set of clusters induced by edges of T i.e. C € Cl(T) if and only if some edge of T represents 
C. We let Cl{T) = U<r e 7-C?(T). Whenever we (reasonably) assume that all singleton clusters are 



present in the inpul]^] it is easy to see that every cluster set C on X can be written as C7(T) for 
some T as follows. We take any proper coloring of IG(C) (i.e. map the nodes of IG(C) to colors 
such that no two adjacent nodes have the same color) and use the resulting colors to partition C. 
Clusters that have been colored the same are all mutually compatible, so can be represented by 
a single tree corresponding to that color. Finally, whenever a subset of clusters pertaining to a 
color does not cover all elements of X, the missing taxa can be attached to the root. An obvious 
corollary of this is that the chromatic number of IG{C) is a lower bound on the cardinality of T. 

Whenever C = Cl(T) there is an important relationship between the nodes and edges of trees 
in T, and the (maximal) ST-sets of C. Let T be a (not necessarily binary) tree on X. In Section [2] 
we defined when an edge of a tree represents a cluster. Here we extend this definition to nodes of 
trees. We say that a node v of T represents C if C is equal to the union of the clusters represented 
by some (not necessarily strict) subset of its outgoing edges. Note that if an edge (it, v) represents 
a cluster C then so does v. 

Lemma 6. Let T = {Ti, . . . , T m } be a set of trees on X . Let C X' C X be an unseparated set 
with respect to Cl(T). Then for each € T there exists an edge ej or a node Uj in such that 
ei or Vi represents X' . 

Proof. Consider an arbitrary tree Tj E T. Every cluster in Cl(Ti) is either disjoint from X', a 
superset of it, or a subset of it, otherwise X' would be separated. If some edge of Tj represents 
X' then we are done. Otherwise consider some node v furthest from the root which represents a 
cluster C such that X' C C. Such a v must exist because if necessary we can take the root as 
v. C is equal to the union of the clusters represented by some subset of the edges outgoing from 
v. Each cluster represented by an outgoing edge of v is either disjoint from X' or a subset of it, 
because of the assumption on the distance of v from the root. For the same reason, X' intersects 
with at least two such outgoing edge clusters of v. But when X' intersects with such a cluster it 
must contain it entirely, so X' is equal to the union of some subset of the clusters represented by 
edges outgoing from v. Hence v represents X' . □ 

The following corollary is automatic. 

Corollary 5. Let T — {Ti,...,T m } be a set of binary trees on X. Let C X' C X be an 

unseparated set with respect to Cl(T). Then for each Ti € T there exists an edge ej such that e, 
represents X' . 

Note that Lemma [6] and Corollary [5] hold in particular for (maximal) ST-sets, because all 
ST-sets are unseparated. Hence the two following straightforward extensions to ST-sets, which we 
will use extensively in the next section. Recall that an ST-set S is trivial if S — or S = X. 

Corollary 6. Let T = {T\, . . . ,T m } be a set of binary trees on X. Let S be a non-trivial ST-set 
with respect to Cl(T). Then for each Tj eT there exists an edge ej in Tj such that ej represents 

s. 

Corollary 7. Let T — {Ti, . . . , T rn } be a set of binary trees on X . Then Cl(T) contains at most 
2(n — 1) non-trivial ST-sets, and for every such ST-set S of Cl(T) and every tree Ti 6 T there 
exists a unique edge ei of Ti such that X(ei) = S and such that the subtree rooted at the head of 
ei is the unique tree that represents exactly the cluster set Cl{T)\S. 

Proof. Given any tree Ti € T, Cl(Ti) will contain exactly 2(n— 1) edges and consequently 2(n— 1) 
clusters. Since an ST-set is by definition unseparated, each ST-set of Cl(T) is a subset of the 
cluster set Cl(Ti) (for any i) and there are thus at most 2(n— 1) ST-sets. Moreover, by definition, 
for any ST-set S of Cl(T) we have that the set Cl(T)\S is compatible. Since two non-isomorphic 
binary trees on the same taxa set induce at least two incompatible clusters, this concludes the 
proof. □ 

5 The presence or absence of the singleton clusters in the input does not change the complexity of the 
problems we study because it is trivial to modify a network without raising its reticulation number or 
level such that it also represents all the singleton clusters. 



Informally Corollary[7]states that when T consists solely of binary trees, each non-trivial ST-set 
corresponds to some subtree that is common to all trees in T. 



5 Clusters obtained from sets of binary trees on X 

Let T be a set of binary trees on X . In this section we prove prove that, for fixed r > 0, it is 
possible to construct in polynomial time all binary phylogenetic networks with reticulation number 
r that represent Cl(T). We also describe our new program Clustistic which implements this 
algorithm. Clustistic is in itself an important "proof of concept" : it has been rapidly prototyped 
by, building on insights from |25j , slightly modifying existing software that was originally conceived 
to reconstruct binary level-fc networks not from clusters but rooted triplets. 

Let TV be a network on X and let T' be some tree on X' C X. We say that T" is a Subtree 
Below a Reticulation (SBR) of N if there is a reticulation node v in N such that no reticulation 
nodes v' 7^ v are reachable from v by a directed path, and that the subnetwork rooted at v (or, 
when v has outdegree exactly 1, the child of v) is exactly equal to T". It is easy to show that (by 
virtue of its acyclicity) every network contains at least one SBR 28J. A simple though critical 
observation is: 

Observation 2. IfT' is an SBR of a network N on X that represents a cluster set C, and T' has 
taxa set X' , then X' is an ST-set with respect to C. 

We will make repeated use of this throughout the rest of the article. (Note however that in 
general an ST-set will not specify a unique SBR). 

Given a network N with an SBR T we denote by N\t the network obtained from N by deleting 
T and for as long as necessary applying the following tidying-up operations until they are no longer 
needed: deleting any node with outdegree zero that is not labelled by an element of X; suppressing 
all nodes with indegree and outdegree both equal to 1; replacing multi-edges with single edges; 
deleting nodes with indegree-0 and outdegree- 1. 

We call (Si, S%, S p ) (p > 0) a (maximal) ST-set sequence of C if Si is a (maximal) ST-set 
of C, S2 is a (maximal) ST-set of C \ Si, S3 is a (maximal) ST-set of C \ S\ \ S2 and so on. Such a 
sequence is additionally a tree sequence if all the clusters in C\ Si \ . . . \ S p are mutually compatible 
i.e. can be represented by a tree. Wc denote by p be the length of the sequence; p — denotes the 
empty tree sequence. 

Lemma 7. Let N be a network that represents some cluster set C on X . Then there exists a 
sequence of SBRs that need to be removed to prune N into a tree, and this corresponds to an 
ST-set tree sequence of C of length r{N). 

Proof. If N is a tree then there is an empty tree sequence. Otherwise, N has an SBR T' with 
taxa set X' and the network N\t> represents the cluster set C \ X' . Clearly r(N') < r(N). By 
observation |2] we let Si equal X' and d — r(N) — r(N'). If d > 1 then we let S2, • • ■ , Sd all equal 
0, the empty ST-set. We use these empty ST-sets to model the situation when removing the SBR 
would cause multiple reticulation nodes to disappear simultaneously. Now, N' also has at least 
one SBR, so we can iterate the whole process. We can repeat this until we obtain a tree: at this 
point we will have an ST-set tree sequence of length exactly r(N). □ 

In the following observation we make use of the fact that the operation N\t is also meaningfully 
defined when the tree T is exactly the subtree rooted at some node of N. 

Observation 3. Let T = {T\ , . . . , T m } be a set of binary trees on X . Let S be a non-trivial ST-set 
of Cl(T). Then Cl(T) \ S = Cl(T') where T 1 is a set of at most m binary trees {T{, . . . , T^} on 
X\S with T[ — Ti\T v , where ei — (u,v) is the edge of Ti that represents S (which exists by 
GorollaruYK an d T v is the subtree rooted at v. 
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Fig. 6. Let C be some set of clusters represented by the network on the right. If we guess that 
ST-set S = {h,i,j} corresponds to an SBR and remove it, we obtain the tree on the left. To 
reverse the process we add (a tree corresponding to) S below a reticulation node whose incoming 
edges subdivide edges x and y. 

(a) (b) (c) 

Fig. 7. The different ways of adding a reticulation back into a network, as discussed in the proof 
of Theorem [2j (a) Two different edges are subdivided; (b) one edge is subdivided twice; (c) one 
edge is subdivided once, under which a multi-edge is placed. 



Theorem 2. Let T = {T 1; . . . ,T m } be a set of binary trees on X. Then for a constant r > it 
is possible to construct in polynomial time all binary networks with reticulation number at most r 
that represent Cl(T) (if any exist). 

Proof. Without loss of generality assume we wish to construct all such networks with reticulation 
number exactly r. Let us suppose that at least one such network N exists. By Lemma [7] there is 
an ST-set tree sequence S = (Si, . . . , S r ) for Cl(T) and this corresponds to a sequence of SBRs 
that, when removed, will prune N into a tree. Let |X| = n. Now, note that by Observation [3] 
and Corollary [7] there are at most 0(n r ) ST-set tree sequences, which is polynomial in n for fixed 
constant r. S will be one of these, so we can find S in polynomial time. It remains to show how, 
assuming we have found S, we can reconstruct N . First, note that for a binary tree T on X, 
T is the unique tree on X that represents Cl(T). The clusters that still remain after removing 
S r can be represented by a tree, and in particular (by Observation [3]) by a unique binary tree. 
We call this tree N r . We want to obtain N r -i by inserting (a tree corresponding to) S r into N r . 
In particular, we wish to introduce a new reticulation node into N r below which a tree T (itself 
binary and unique by Corollary [7]) that represents S r will be attached. We have two possibilities 
to do this: we subdivide two (not necessarily distinct) edges and use these as the tails of the new 
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7 &) and (b), or we subdivide one edge once and use two identical 
7 c) . Note that, if we only subdivide one edge once, we actually 
create a multi-edge and by our definition of phylogenetic network such edges are not allowed. 
However it might be necessary to create such multi-edges during intermediate iterations to ensure 
that all phylogenetic networks, including "redundant" ones, are constructed. 

Unfortunately we do not know in general which edge(s) of N r to subdivide to create the new 
connection point(s). However, there are only polynomially many edges in N r , so we simply try 
them all. Then we repeat the process, inserting SV-i into N r -i to obtain iV r _2, and so on until we 
have obtained N . Given that r is a constant we can test in polynomial time whether N represents 
all the clusters in C (see proof of Lemma [3]) . 

Just as in the similar algorithm described in [28] there are several slight technicalities that 
should be noted. Whenever some Si = we use a "dummy" taxon (i.e. some taxon not in X) 
as the tree that we attach below a reticulation. The function of this is to ensure that subsequent 



iterations can subdivide the edge leaving the reticulation i.e. it is a placeholder. (This will be 
necessary when the removal of a single SBR caused the disappearance of two or more reticulations). 
These dummy taxa can be removed just before Nq is inspected to check whether it represents C. 
Any No that at this point still contains a dummy taxon whose parent is a reticulation, should be 
rejected, because it means at least one reticulation was not used. Secondly, when we construct the 
tree N r we actually add a "dummy root" which is simply a new node p' and a single edge from 
(p',p) where p is the root of N r . This deals with the situation when the removal of some SBR 
caused the current root to disappear and a new root to take its place. At the end, p' and the edge 
leaving it should be removed. Finally, note that any multi-edges created by intermediate iterations 
of the algorithm should all have disappeared (i.e. have been subdivided by reticulation edges) by 
the time A has been reached; for this reason we reject any N that still contains multi-edges. 

The network N we would like to reconstruct will eventually be found as some Nq, and given 
that we made no assumptions about N this shows that the algorithm constructs all possible N. 

□ 

Corollary 8. Let f = . . . , T m } be a set of binary trees on X. Then for a constant k > it is 
possible to construct in polynomial time all binary simple level-< k networks that represent Cl(f) 
(if any exist). 

Proof. For each network produced by the algorithm described in Theorem [2] we can easily check 
in polynomial time whether it is biconnected. 

□ 

It is worth noting at this stage an important link with the rooted triplet literature. Recall 
the following proposition and lemma from |25] , an article in which the relationship between trees, 
clusters and triplets was discussed more broadly. Proposition [T] refers to not necessarily binary 
trees. 

Proposition 1. (Van Iersel, Kelk 1251) For any set f of trees on the same set X of taxa, any 
phylogenetic network on X representing Cl(l~) is consistent with Tr{f). 

Lemma 8. (Van Iersel, Kelk [25]) Let N be a phylogenetic network on X and T a set of binary 
trees on X. Then there exists a binary phylogenetic network N' on X such that (a) N' has the 
same reticulation number and level as N, (b) if N displays all trees in T then so too does N' , (c) 
if N is consistent with Tr(T) then so too is N' and (d) if N represents Cl(f) then so too does 
N' . 

Suppose that we have an Algorithm A which, for each fixed r > 0, can construct in polynomial 
time every binary network consistent with Tr(T) that has at most r reticulations, where T is a set 
of binary trees on X . Suppose A' is an algorithm that examines in turn every network output by 
A and rejects it if it does not represent Cliff) (such a filtering step can be done for each network 
in polynomial time for fixed r, see proof of Lemma [3k. If there exists a network A on A 7 with at 
most r reticulations that represents Cliff ) then by Lemma [8] there exists a binary network A' 
on X with this property. Furthermore, in that case N' will by Proposition [l] be consistent with 
Triff). The algorithm A' is thus guaranteed to eventually find A'. The practical consequence of 
this is that, simply by adding a filtering step, triplet software can in some cases easily be modified 
to work for clusters. Indeed, they can be used to determine whether a network with r reticulations 
that represents C exists and if so to construct all binary networks with this property. As proof of 
concept we have taken the triplet software SIMPLISTIC (based on the ideas described in [2"8]). 
removed its biconnectedness-checking subroutine so that it generates all binary networks with 
up to r reticulations (and not just all simple level- < r binary networks), and added the cluster 
filtering step as described above. This whole process took only one day of programming, and lead 
to the new software package CLUSTISTIC which implements the result described in Theorem[2j 



This software is available for download at http://skelk.sdf-eu.org/clustistic 



6 Witnesses and a natural lower bound 



Now, let T = {Ti, . . . , T m } be a set of m not necessarily binary trees on X. Let S be a non-trivial 
ST-set of Cl(T). We know by Lemma [6] that for each Tj € T there is an edge ej or node «j in Tj 
that represents 5. We define a witness for 5 in Tj as follows. If Tj contains an edge ej = (ttj,i>j) 
that represents S 1 , then a witness is any leaf descendant Xi G X of Uj that does not appear in 5, 
i.e. Xi G (Af(ifj) \ 5). Otherwise, from Lemma[6]there exists a node Uj in Tj that represents S, and 
in that case a witness is any leaf descendant Xi G (X(vi) \ S). The only ST-sets with no witnesses 
are X and the empty set, hence the restriction to non-trivial ST-sets. As an example consider the 
four trees T in Figure [8] Consider the ST-set {1} of Cl(T). In the top- left tree and bottom- left 
tree the only possible witness for this is taxon 5. In the top-right tree the only witness is taxon 3, 
and in the bottom-right tree the only witness is taxon 2. 

Given a set of trees T on X and a non-trivial ST-set S of X, let W C X be any subset of taxa 
such that, for each tree Tj G T, there exists x G W that is a witness for S in Tj. We call such a 
set a witness set of S in T. Clearly there exist W such that \W\ < m. For example, for the set of 
trees in Figure [8j {2, 3, 5} is a possible witness set for {1}. 

The two following simple observations are critical. 

Observation 4. Let T be a set of trees on X , S a non-trivial ST-set of X and W a witness set 
of S in T. Then for each C G C(T) such that S C C, W n C ^ 0. 

Proof. For each cluster C G C(T) there is at least one edge e = (u, v) in some Ti G T such that 
e represents C. Let (Vj) be the edge (node) in that represents S. Given that S C C, all 
leaf descendants of u must also include all leaf descendants of the tail of (u,*). In particular, all 
possible witnesses for S in Ti. □ 

To understand the meaning of Observation [4] it is helpful to again consider the example of 
ST-set {1} in the context of Figure [8j We see that any cluster that is a strict superset of {1} must 
contain at least one of the taxa from {2, 3, 5}. 

Observation 5. Let T = {Ti, . . . ,T m } be a set of trees on X such that m > 2 and let S be a 

non-trivial ST-set of C. Let W be a smallest- cardinality witness set of S in T. If \W\ = in then 
for each C G C(T) such that S n C = 0, W \C ^ 0. 

Proof. Note that it is not possible for a witness for 5 in a tree Ti G T to also be a witness for 
S in Tj 7^ Ti, because then \W\ < m — 1. So each witness in W comes from a different tree in 
T. Suppose then that there is some C G C(T) such that S'nC = 0andVF\C = 0. Clearly, 
since W ^ 0, W C C, and |C| > |W| > 2. Furthermore some edge e, in some Tj represents C. 
Combining the fact that 5flC = and W C C leads us to the conclusion that all elements of W 
are possible witnesses for S in Tj . But then some element a; of VF is a witness for £ both in Ti and 
also in some Tj ^ T^, contradiction. □ 

As mentioned in the proof of Observation [5] a witness set W (for a given ST-set 5) with m — 1 
or fewer elements exists if and only if the possible witnesses for 5* ranging across the different Ti 
are not all mutually disjoint. If a witness set with m — 1 or fewer elements does not exist then in the 
following results any witness set with m elements will turn out to be sufficient, as a consequence 
of Observation [5l This will become clear in due course. 



6.1 A natural lower bound on r(T") that is tight for clusters obtained from two 
(not necessarily binary) trees 

Lemma 9. Given a set of clusters C on X , there exists a maximal ST-set tree sequence {S\, S%, S p ) 
such that p < r(C). 

Proof. The proof is equivalent to that of Lemma[7]but for the fact that no empty ST-set is inserted 
in the maximal ST-set tree sequence. □ 



We define the maximal ST-set lower bound for C (MST lower bound for short) as the cardinality 
of the smallest maximal ST-set tree sequence for C. By Lemma [9] this is a genuine lower bound 
on r(C). In general it is however a rather weak lower-bound: consider the set C l of clusters on 
X 1 = {r,xi, . . . Xi} defined by {{r, Xj}\l < j < i}. The MST lower bound for this cluster set is 
always 1, while r{C l ) rises linearly in i. However, as we shall see the tightness of the bound is to 
some extent correlated with the number of trees which generate the clusters, with two trees being 
a special case. We first require some definitions and auxiliary lemmas. 

Consider a network TV on A". Let X' = {p} U X where p ^ X is some arbitrary symbol 
representing the root of N. Let T be some tree on X* where X* n X' = and let H be a subset 
of X' . We can obtain a new network N' on (X* U X) by hanging T from H in N . Informally N' is 
obtained by hanging the tree T beneath a new reticulation which has \H\ incoming edges, where 
each such incoming edge begins "just above" an element of H . Formally the transformation is as 
follows. First we add a new edge (r, r') to T, where r' is the root of T and r is a new node. For 
each h € H \ {p} we then subdivide the unique edge entering h; let h p be the new parent (with 
indegree and outdegree 1) of h. For each h € H \ {p} we then add a new edge (h p ,r). Finally, 
if p £ H we also add an edge from the root of N to r; we call this a root edge. It is clear that 
r(N') = r(N) + \H\ - 1. 

Lemma 10. Let T = {T lf ...,T m } be a set of to trees on X and let C = C'l(T). Let S be a 
maximum ST-set of C. Let N be any network on X\S that represents C\S. Then it is possible 
to extend N to obtain a new network N' that represents C such that r(JV') < r(N) + (to — 1). 

Proof. Let Tg be the unique tree on taxa set S such that C(Tg) = C\S. Let W be a minimum- 
cardinality witness set for S in T. Clearly 1 < \W\ < m. If \W\ = m then we let N' be the network 
obtained by hanging Tg from W in TV. If \W\ < to then we let N' be the network obtained by 
hanging Tg from WU {p} in N. Clearly, r(N') < r(N) + (m — 1). It remains only to show that N' 
represents C. Consider any cluster C € C. There are three cases to consider. (1) If C C S then N' 
clearly represents C because Tg already represented CIS. (2) If S C C then consider C — C \ S . 
Clearly N represents C . By Observation [4] there exists some w € W H C. Since W n S = 0, this 
implies that there exists some w £ W C . 

To see that N' represents C consider any tree displayed by N that represents C . We can 
extend this tree by "switching on" the new reticulation edge that begins above w, i.e. the edge 
(ui p , r), and "switching off" the remaining reticulation edges. (3) If S fl C = then there are two 
subcases. (3.1) If \W\ < m then we can "switch on" the root edge that enters the reticulation 
above Xg, i.e. the edge (p,r), and "switch off" all other reticulation edges entering Tg. (3.2) If 
\W\ = m then by Observation [5] there exists w € W \ C. In N' we can thus "switch on" the new 
reticulation edge that begins above w, and "switch off" the rest. □ 

Theorem 3. Let T = {Ti, . . . ,T m } be a set of m trees on X and let C = Cl(T). Let p be the 
MST lower bound for C. Then r(C) < (to — l)p. 

Proof. Given a tree T and a node u of T, we denote by X(T) the label set of T and by T u the 
subtree rooted at u. From Lemma [9] we already know that p < r(C). Now, let (Si, S2, S p ) be 
a maximal ST-set tree sequence for C. We will complete the proof by showing how to explicitly 
construct a network N with reticulation number at most (to — l)p that represents C. We define Cj, 
1 < i < p, as C \ Si \ . . . \ Si and Co as C. By Lemma|6]it can be seen that for each i, Ci — Cl{%) 
where Ti is a set of at most to trees on X \ Si \ ... \ Si and where 7o = T ■ In particular, 7I+i 
can be obtained from Ti as follows. Given a tree Tj in %, let (without loss of generality) Uj be the 
node of Tj such that Si is equal to the union of the clusters represented by some not necessarily 
strict subset of its outgoing edges. Such a Uj exists by Lemma [6] Let Q = {vi, . . . , Vk} be the set 
of children of Uj such that for each v £ Q, X(T V ) contains at least one element of Si- The set of 
trees 7i+i can be obtained from 71 by computing, for each tree Tj in % the tree Tj \ T Vl . . . \ T Vk 
i.e. pruning away the subtrees corresponding to Si and tidying up the resulting tree. 

Now, consider C p . Let N p be the unique tree such that C(N p ) = C p ; N p will be equal to the 



single tree in T p . By Lemma 10 we can obtain a network N p _i with (to — 1) reticulations that 



represents C p _i by taking T — T p -\, N — N p and S = S p in the proof of that lemma. We iterate 
this process for p — 2, p — 3, ... 1. This lasts at most p iterations in total, and each iteration adds 
(to — 1) to the reticulation number, thus yielding a network iVo that represents C with reticulation 
number (at most) (to — l)p. □ 

Corollary 9. Let T — \T\,T^ be a set of two not necessarily binary trees on X and let C = 
Cl(T). Letp be the MST lower bound for C. Then r(C) = p. 

In [25] it is shown that it is NP-hard and APX-hard to compute r(C) where C is the set of 
clusters obtained from two binary trees on X. The following corollary is thus immediate. 

Corollary 10. The computation of the MST lower bound is NP-hard and APX-hard. 

It is interesting to note that Lemma[9]and Corollary |10| have, in some sense, already appeared 
in the phylogenetic network literature, albeit in the language of recombination networks. Specif- 
ically, in [25] we highlight that the phylogenetic network model described there (and also used 
here) is in a strong sense identical to the recombination network model under the assumption of 
an all-0 root, the infinite sites model and multiple crossover recombination. The computational 
lower bound described in Algorithm 3 of [TS] is, taking this equivalence into account, essentially 
identical to the MST lower bound. In [33] it is shown that computing this bound is NP-hard, by 
reduction from MAX- 2- SAT. The same authors also give an exponential-time dynamic program- 
ming algorithm for computing the bound because in |18) it was not explicitly indicated how this 
should be computed. 

7 The optimality and non-optimality of Cass 

The Cass algorithm for constructing simple level-A; networks was presented in [37]. The algorithm 
was designed to produce solutions of minimum level, not of minimum reticulation number. How- 
ever, when the input is a separating set C of clusters on X, minimizing the level or the reticulation 
number is equivalent. Indeed, such cluster sets have the property that any network that represents 
them is simple or can easily be made simple (see Lemma [T]) and a simple network contains exactly 
one non-trivial biconnected component. 

The Cass algorithm can be used as a subroutine in a divide and conquer algorithm to construct 
general level-fc networks. We will call this more general algorithm Cass dc . The basic idea of 
Cass dc is that it transforms each connected component of IG(C) into a tangled set of clusters, 
runs Cass separately on each of these tangled sets, and combines the resulting simple networks into 
a single final network N (Recall that tangled cluster sets are separating). The final network N has 
reticulation number equal to the sum of the reticulation number of the simple networks produced 
by Cass, and N has level equal to the maximum level ranging over all the simple networks. For 
more details on the divide and conquer strategy, see [37J and [T2J, Section 8.2. 

In [57] the authors proved that if there is a level- < 2 network that represents C, then Cass dc 
will find such a solution with minimal level. Here we clarify several other properties of the al- 
gorithm. On the negative side we show (using a special separating cluster set) that the Cass 
algorithm does not in general minimize level. On the positive side we show that when the in- 
put set C is equal to C7({7i,T2}) for any two (not necessarily binary) trees, Cass dc correctly 
minimizes level. In fact we show something even stronger: in this case Cass dc also correctly 
constructs networks with minimum reticulation number, which in turn is exactly equal to the 
hybridization number of the two input trees i.e. the number of reticulations required to display 
the trees themselves. We conclude with several open questions regarding Cass. 

7.1 Cass: the high-level idea 

Let N be a network that represents a set of clusters C on X. Let S be a non-trivial ST-set with 
respect to C. We say that S is under a cut-edge if N contains a cut-edge (u, v) such that the 
subnetwork rooted at v is a tree that represents C\S. 



The next two results formed (implicitly) the direct inspiration for Cass, which was designed 
to be a generalization of these results. 

Lemma 11. Let N be a network that represents a set of clusters C on X . Let S be a non-trivial 
ST-set with respect to C. Then there exists a network N' such that r(N') < r(N), £{N') < £{N), 
S is under a cut-edge in N' and for each ST-set S' such that S' HS — and S' is under a cut-edge 
in N, S' is also under a cut-edge in N' . 

Proof. Deferred to the appendix. □ 

The following corollary follows from the fact that maximal ST-sets are disjoint: 

Corollary 11. Let N be a network that represents a set of clusters C. There exists a network 
N' such that r(N') < r(N), £(N') < £(N) and all maximal ST-sets (with respect to C) are below 
cut- edges. 

The pseudocode for Cass was originally given in |27) . That exposition is however rather dense 
and technical. See Section 8.5 of [12] for a clearer detailed description. Here we only give the 
core idea of the algorithm. Let us assume without loss of generality that we want to know, given 
a separating set of clusters, whether a simple network solution exists with reticulation number 
exactly k, for some constant k. 

Cass tries to answer this by searching through the space of all maximal ST-set tree sequences 
of length at most k, attempting to build a network with reticulation number k from each one. 
It looks first at shorter maximal ST-set tree sequences, padding those of length less than k with 
empty ST-sets to attain a sequence of length exactly k. (As in the proof of Theorem [2] this models 
the situation when removing a single SBR causes the reticulation number to drop by more than 1). 
If there are no maximal ST-set tree sequences of length at most k then Cass will correctly report 
that no solutions with reticulation number k or lower exist. Hence Cass implicitly computes and 
incorporates the MST lower bound. 

Assuming maximal ST-set tree sequences of length at most k do exist, Cass examines each 
one to determine whether it can be constructively turned into a real solution. Let S = {Si, . . . , Sk) 
be a (possibly padded) maximal ST-set tree sequence of length- A: for C. As in earlier sections we 
define Ci, 1 < i < k, as C \ Si \ . . . \ Si, and we let Co = C. Cass does not however work with 
the set Ci. Instead it works with C[ which is obtained from each Ci by "collapsing" every maximal 
ST-set S with respect to Ci into a single new "meta-taxon" . This is an extremely greedy step, and 
is in some sense an attempt to generalize Corollary |11| The informal motivation is this: we know 
from Corollary [TT] that there exists some network N with a minimum number of reticulations such 
that all maximal ST-sets are under cut-edges. Suppose we guess an SBR T on X' C X (where X' 
is a maximal ST-set) of N; we only have to make polynomially-many guesses because there are 
only polynomially-many maximal ST-sets. This gives a new network AT on X \ X where perhaps 
not all maximal ST-sets (with respect to C \ X') are under cut-edges. In particular, some SBRs of 
N' might correspond to non-maximal ST-sets, of which there are potentially exponentially many, 
so how do we efficiently guess an SBR of N'l Fortunately, we can transform N' (in the sense of 
Corollary [TT]) to obtain a new network N" such that r(N") < r(N') and where all maximal ST-sets 
(with respect to C \ X') are under cut-edges of N" . Hence we know that we again only have to 
make polynomially-many guesses to locate an SBR of N". Furthermore, Cass assumes that these 
maximal ST-sets will always remain below cut-edges, so it collapses them into the aforementioned 
meta-taxa. We iterate this entire process k times, concluding the "inward" phase of Cass. 

This assumption is important because it affects the "outward" phase of Cass, which begins 
immediately after completion of the inward phase. As in for example Theorem [2] the general idea 
is to start with a tree that represents and then to work backwards, first trying all pairs of 
edges from which to "hang back" a SBR corresponding to Sk, then all pairs of edges (of the 
resulting network) from which to hang back an SBR corresponding to Sk-i, and so on, down 
to Si. However, before hanging back each Si it first decollapses the maximal ST-sets that were 
collapsed into meta-taxa during the corresponding iteration of the inward phase. 

In Section |A.2| of the appendix we will explicitly and exhaustively walk through a specific 
execution of the Cass algorithm and this is helpful for clarifying exactly how the algorithm works. 



7.2 The Cass algorithm is not always optimal 

In this section we present a counter-example proving that the Cass algorithm does not always 
minimize level. 







Fig. 8. Let T be the set of four trees shown here. The Cass algorithm returns a network N 
that represents Cl(T) where r(N) — £(N) = 4. However, Figure [9] shows that the true value of 
r(Cl(T)) = £(Cl(T)) is at most 3. 





Fig. 9. (a) A simple level-3 network and (b) a simple level-4 network, both representing Cl(T) 
where T is defined as described in Figure [8] The level-4 network was produced by Cass dc . 



Consider the set Ta of four binary trees shown in Figure [8] It is easy to verify that the set 
C = Cl{T) is separating and every network that represents C is thus simple or can easily be made 
simple. It is also easy to verify that the simple level-3 network in Figure [9]ja) represents C; in the 
appendix we prove that this is optimal by showing that any network that represents C must have 
reticulation number 3 or higher. 



However, Cass cannot find a level-3 network. To formally prove this we show in Section A. 2 
of the appendix an exhaustive list of all possible executions of the algorithm with k = 3. Cass 
returns the simple level-4 network shown in Figure [9jb) . To summarize, the problem with Cass 
seems to be that while the step of always collapsing at every iteration all maximal ST-sets and 
treating them as meta-taxa (in the sense of Corollary [IT]) is a locally optimal move, it can force us 
to use too many reticulation edges when hanging (trees corresponding to) maximal ST-sets back 
in the outward phase. 

Note that this counter-example fits in a tradition of highly specific and complex counter- 
examples in the phylogenetic network literature. In particular we note the very similar counter 
examples given initially in 8J, and (based on this) in [ill , which showed that one cannot minimize 



reticulation number by optimizing independently over the connected components of the incompat- 
ibility graph IG(C). The relationship between these counter-examples - all of which are linked in 
some way or other to simple level-3 networks - seems to be that networks with minimum reticu- 
lation number have a very subtle internal structure that seems impervious to locally optimal and 
greedy strategies, but that these properties only start emerging for level-3 and higher. 

It is important to emphasize, however, that since we could not find any non-synthetic dataset 
for which Cass does not find an optimal solution, we still have the feeling that Cass works quite 
well for real data. 



7.3 Cass is optimal for sets of clusters obtained from two trees 

Here we show that, despite the negative news in the previous section, Cass (and more generally 
Cass dc ) correctly minimizes level in the case of clusters obtained from two not necessarily binary 
trees. Furthermore the algorithms also provably minimize reticulation number. 

Note that the following theorem does not contradict the NP-hardness mentioned in Corollary 



10 because Cass only runs in polynomial time when it bounds its search to simple level- < k 



networks, for a constant k. 

Theorem 4. Let C — Cl{T~) be a separating set of clusters where T is a set of two not necessarily 
binary trees on X . Then Cass constructs a simple network N that represents C such that £(N) = 
r(N)=£(C) = r(C). 

Proof. Let the MST lower bound for C be p. Recall that, by Corollary [9j in this case p = r(N). 
We know that there is a maximal ST-set tree sequence (Si, . . . , S p ). As explained in Section |7.1 



CASS will eventually find this maximal ST-set tree sequence. Now, Theorem [3] essentially works by 



invoking Lemma 10 p times, and in the statement of Lemma 10 there are absolutely no assumptions 
made about the structure of "iV" , other than that it represents a certain set of clusters. Hence "iV" 
can just as well be one of the intermediate networks constructed by Cass. It remains only to show 
that Cass can simulate the hanging-back construction described in the proof of Lemma [T0| This is 



definitely so, because the proof of Lemma 10 requires the edges entering two witnesses (or the edge 
entering one witness and, to simulate the attachment of a root edge, the edge connecting a dummy 
root to the real root) to be subdivided. Cass tries subdividing all pairs of edges, including the 
edge between the dummy root and the real root, and hence will eventually subdivide the correct 
two edges. □ 

The proof of Theorem [4] not only shows that Cass is optimal for sets of clusters obtained from 
two not necessarily binary trees, but also that in this very special case the "hanging back" (i.e. 
outward) phase of Cass is in some sense completely redundant. In particular: if we have already 
computed a maximal ST-set tree sequence, then we can easily compute a witness set for each of the 
maximal ST-sets in the sequence, and these witness sets directly specify a sufficient set of edges 
to subdivide when hanging back the maximal ST-sets. So in the case of clusters coming from two 
trees Cass wastes rather a lot of time trying to hang back maximal ST-sets from all possible pairs 
of edges, when in fact the information is already available to make this blind search unnecessary. 

Theorem [4] can actually be re-formulated and extended to general (i.e. not necessarily separating) 
sets of clusters C obtained from two not necessarily binary trees. Indeed, in Theorem [6j we will 
prove that, for such cluster sets, CASS DC reconstructs a network N such that r(N) = r(C) and 
£(N) = £(C). However, we first need to prove Theorem [5] below, which is interesting in its own 
right, and several auxiliary results. 

Observation 6. Let C be a set of clusters on X and let U C X be an unseparated set with respect 
to C. Then for any P C X , U \ P is unseparated with respect to C\(X \ P). 

Proof. U is unseparated with respect to C so for each cluster C € C we have either that Cfl U = 0, 
C C U or U C C. For the case CCiU = it is clear that (C\P)f](U\P) = 0. For the case C C U 
we have that (C \ P) C (U \ P), and for the case U C C we have that (U \ P) C (C \ P). □ 



Observation 7. Let C be a set of clusters on X and let U be the union of the set of clusters in a 
connected component K of IG(C). Then U is unseparated. 

Proof. Suppose U is not unseparated. Then there must exist some C € C such that C %U ,U %C 
and C n U 7^ 0. Clearly C cannot be incompatible with any cluster in the connected component 
K, because then C would also be in the connected component K and thus C C U. Hence every 
cluster in the connected component K is either disjoint from C, or a subset of it, and there is 
at least one of each type of cluster because U\C ^ and U is the union of all the clusters in 
K. However, clusters in K that are disjoint from C, are always compatible with clusters in the 
connected component that are contained inside C , so the connected component is not connected, 
contradiction. □ 

Observation 8. Given a set of clusters C on X , let S be an ST-set with respect to C and let U 
be an unseparated set such that U C S. Then U is also an ST-set for C. 

Proof. We only need to show that all pairs of clusters in C\U are compatible. Clearly, for each 
CeCwe have that C\U C U. Now, recall that, because U is unseparated, for each C € C we have 
either C C U, U C C or C n [/ = 0. Suppose by contradiction that for some C\ 7^ C2 € C\U, C\ 
and C2 are incompatible. But then C C\, C% C U. But in that case C%, G% £ C and C\,Cz C S, 
contradicting the fact that S was an ST-set. □ 

For a set of clusters C and an unseparated set U with respect to C, Cj/_> u denotes the new 
cluster set obtained by replacing all elements of U with a single new taxon u (i.e. "collapsing" U 
into a single taxon). 



Case 1 .1 Case 1 .2 Case 1 .3 




Fig. 10. The three subcases covered by Case 1, which concerns the case when the ST-set Si that 
is being hung back, is disjoint from U. The parts of the network containng elements of U are 
depicted in grey. 



Theorem 5. Let T = {TijTg} be two not necessarily binary trees on X , and let C = Cl(T). Let 
U C X be an unseparated set with respect to C. Then r{C) — r(C\U) + r(Cjj _>.«). 

Proof. Let p — r(C). We know by Corollary [9] that there exists a maximal ST-set tree sequence 
(Si, . . . , S p ). As usual we define d, 1 < i < p, as C \ Si \ ... \ Si, and we let Co = C. We let 
*i = Ucec.C where X = X. (Note that X> = X \ Si \ ... \ Si). Since U \ Si \ ... \ S t = X % n U, 
by repeated application of Observation [6j Xi n U is unseparated with respect to C; for < i < p. 
Now, recall (see proof of Lemma [l0"| that for each Si we can identify a set of two witnesses (where 
perhaps one of the witnesses is the symbol p representing the root). 

Here we show, for each Si, how to hang back a tree representing Cj_i|Sj from a network N 
representing Cj to obtain a network N' representing C^_i where r(N') — r(N) + 1 and in N' the 
taxa Xi-i H U are exactly the set of taxa below some cut-edge. The witnesses of Sj guide us how 



to do this. If we repeat this p times we will obtain a network with p reticulations (and reticulation 
number p) that represents C and such that the taxa in U are exactly the subset of taxa below 
some cut-edge. The theorem will then follow. 

The first thing to do is to study the earliest point at which some elements of U are added back 
into the network. This is an important "base case" . Let us thus consider the largest value of i such 
that Xi n U 7^ 0. Let i' be equal to this value. Now, suppose i' = p. We saw that by repeated 
application of Observation [6] X p PI U is unseparated with respect to C p . Furthermore we know that 
the clusters C p can be represented by a tree, so X p P\U is actually an ST-set. Hence we can assume 
without loss of generality (by Lemma 11) that the tree that represents C p , has a cut-edge such 
that X p n U is exactly the set of taxa beneath it. Alternatively, suppose i' < p. In this case the 
first elements of U that are reintroduced into the network are a (not necessarily strict) subset of 
SV+i, so we have X^ n U = SV+i H U. Given that AV n U is unseparated w.r.t. CV, and Sy+i n U 
is a subset of an ST-set, it follows by Observation [8] that SV+i H £/ is also an ST-set. We may thus 
assume without loss of generality that AV n U is exactly the set of taxa below a cut-edge (again 
thanks to Lemma 111. 

Henceforth we may assume that the network N that we want to hang (a tree corresponding 
to) Ci-\\Si back from, contains at least one taxon of U. We will make heavy use of this fact. Let 
e be the cut-edge of N which the elements of Xi n U are below. Let w\, W2 be the two witnesses 
for Si, where in some cases 1112 = p (representing the root). 



There are several cases to consider. The first case is when the tree that we are hanging back, 
is disjoint from U. See also Figure [TO] 

Case 1) S % n U = 

Subcase 1.1) {101,11)2} H U = 0. In this case we can simply hang back from {u>i,u>2} because 
we are not adding any new elements of U and we are not subdividing any edge reachable from e. 
Hence e remains the cut-edge which all present elements of U are below. 

Subcase 1.2) {wi,^} C U. If we simply hang Si back from {101,11)2} then we obtain a net- 
work that represents C;_i. However, we see from this that every cluster C in Cj_i that is a strict 
superset of Si, must contain at least one element of U. Si is disjoint from U so by unseparation C 
must also contain all other elements of U in the network. Hence we do not actually need to put 
Si below a reticulation at all: we can simply attach it to a single new cut-edge that subdivides 
e. (This case cannot actually occur because it saves a reticulation and hence implies that r(C) < p.) 

Subcase 1.3) (wlog) W\ 6 U, W2 $ U. Suppose we simply hang back from {101,102}, obtaining 
a network N' that represents Cj_i. Note that after doing this any cluster that passes through 
the reticulation edge starting just above w\, must (because of the unseparation of U) contain all 
elements of U that are in the network. So we can subdivide the cut-edge e and move the tail of of 
that reticulation edge to the newly created vertex. 

The next case is when all elements of Si arc in U, see also Figure [TT] 

Case 2) S t C U 

Subcase 2.1) {101,102} n U = 0. In this case we don't need a reticulation at all: we can just 
hang Si from a single new cut-edge that subdivides e. (This case cannot actually happen because 
it saves a reticulation: see case 1.2). 

Subcase 2.2) {101,102} C U. This case is fine because if we hang back from 101 and W2 all 
the elements of U remain below the cut-edge e. 



Subcase 2.3) (wlog) w\ € U, u>2 $ U. Suppose we hang back from w\ and W2 to obtain a 



Case 2.1 Case 2.2 Case 2.3 




Fig. 11. The three subcases covered by Case 2, which concerns the case when the ST-set Si that is 
being hung back, is a subset of U. The parts of the network containing elements of U are depicted 
in grey. 

network that represents C,-_i. In this case we could move the reticulation edge that starts just 
above W2 (or at the root, in the case that W2 = p) to subdivide the cut-edge e. 



Case 3 




Si Si\R 



Fig. 12. Case 3, which concerns the case when the ST-set Si that is being hung back, contains 
elements of U and elements not in U. The parts of the network containing elements of U are 
depicted in grey. Note that in this case we do not care where the reticulation edges connect to the 
rest of the network. Here R — Si\U . 



The final case is where S contains at least one element of U and at least one element not in 
U. See also Figure [T2| 

Case 3) S, n U ^ and Si\U^$ 

In this case we will apply a transformation which brings us back into Case 2. Let R = Si\ U. 
Suppose we hang Si back from its two witnesses W\ and W2, to obtain a network N' that represents 
Cj_ x . Consider any cluster C e Cj_x such that C n R ^ 0. Then either C C R or X l _ 1 n U C C. 
(The second condition holds because, if C 2 R, then it must contain some element in Si (1 U , and 
hence all elements of U in the network). So a cluster C that is a strict subset of Si is either a 
subset of R or disjoint from R. Indeed, if C contains one element of R and one of the unseparated 
set U, C must contain all elements of U, a contradiction since C is a strict subset of Si- It follows 
that R is unseparated so by Observation M R is an ST-set. Hence we may assume (without loss 



of generality) that R is the taxa set of some subtree T' below a cut-edge in the tree representing 
Ci-i\Si that we hung back. Now, we can prune T' and regraft it back onto the network at a new 
vertex obtained by subdividing e. It can be verified that after this prune/regraft move the resulting 
network still represents Ci-±. It remains only to apply Case 2, taking w\, W2 as the witnesses and 
Si\R as the ST-set that we want to hang back. □ 

Theorem 6. Let T = {T-\_,T2~\ be two not necessarily binary trees on X, and let C = Cl(T). 
When given C as input, Cass computes a level-£(C) network N with reticulation number r(C) 
that represents C. 

Proof. Observation [7] and Theorem [5] ensure that we can analyze each connected component of 
the incompatibility graph IG(C) separately, which (as mentioned) is exactly what Cass dc does. 
(To see this it is helpful to note that any subset of C can also, with the possible exception of some 
superfluous singleton clusters, be expressed as the set of clusters in two trees). 

Let K be a connected component and denote by Ck the set of clusters in K. Let Xk be the 
set of taxa equal to the union of all clusters in Ck- Note that Ck is not necessarily a tangled set. 
Indeed, while IG(Ck) is connected, the second property of a tangled set (every pair of taxa x and 
y in Xk is separated by Ck, Section [2]) does not always hold. To ensure that the latter condition 
holds Cass dc simply computes all the maximal ST-sets {Si, Sk} for Ck and for each of them 
replaces all elements of Si with a single new taxon Si in Ck, to obtain a new cluster set C' K that 
is tangled. 

To see that the constructed network has minimum level, observe that (by Theorem BJCass 
correctly computes minimum level solutions for the C' K cluster sets mentioned above. In |27j it is 
proven that combining minimum-level solutions for the various C' K yields a minimum-level solution 
for C. 

We now need to prove that the constructed network has reticulation number r(C). Since max- 
imal ST-sets are unseparated and all mutually disjoint (see Corollary [4]) , it follows from Theorem 
[5] that: 

r(C K )= r(C K \S)+r(C' K ). 
S is a maximal 
ST-set of C K 

Since r(Cx\S) always equals zero when S is an ST-set, r(Cx) = r(C' K ). Moreover, since the sets 
Xk are also unseparated (from Observation [7]) and all mutually disjoint, then from Theorem[5]we 
have that: 

r(C)= Y, r(C\X K )+r{C), 

K is a connected 
component of IG(C) 

where C is obtained from C by replacing all elements of Xk with a single new taxon xk- Obviously, 
r(C') = 0. Since t{C\Xk) = t(Ck) = r (C' K ), and r(C' K ) = i(C' K ) because C' K is separating, this 
concludes the proof that the constructed network has reticulation number r(C). 

□ 

7.4 Cass can be used to compute the hybridization number of two not necessarily 
binary trees 

It is important to understand the relationship between the results presented in the previous section 
and the extensive literature on computing the hybridization distance of two trees |3lll2l4lf7] i.e. 
the problem of displaying the trees themselves, and not just their clusters. 

For two binary trees T = {Ti, T2} on X the hybridization distance h(T) is defined as the mini- 
mum reticulation number r t (T) of any network that displays both the trees in T. In |25j we showed 
that r t (T) — r(Cl(T)). As discussed in [25] that means that both positive and negative results 



for the computation of rt{T) transfer automatically to computation of r(Cl(T)) (for two binary 
trees). Negative results are NP-hardness and APX-hardness; positive results include fixed param- 
eter tractability and running time improvements based on an increasingly deep understanding of 
maximum acyclic agreement forests. 

Before proceeding there are some technical issues regarding the definition of hybridization 
number of a set T of two not necessarily binary trees on X . This can be attributed to the fact 
that several different definitions have appeared in the literature: 

1. h + (T) : the minimum reticulation number of N ranging over all networks N that display in 
a strict topological sense all the trees in T ■ This is exactly the definition of display given in 
Section [2] In Figure 8 of [25] this strict definition was used. 

2. h (T): the minimum reticulation number of N ranging over all networks N that display some 
not necessarily binary refinement of each tree in T . Recall that a (binary) refinement of a tree 
T on X is any (binary) tree T' on X such that Cl(T) C Cl(T'). (Note that a tree is generically 
considered to be a refinement of itself). 

3. h~ (7"): the minimum reticulation number of N ranging over all networks N that display a 
binary refinement of each tree in T . This was the definition used in [17j . 

Two of the definitions, h° and h~ , turn out to be equivalent. We clarify this, extend an 
equivalence result from [25] and thus show that Cass dc correctly computes the hybridization 
number of two not necessarily binary trees in the sense of h , equivalently h~ . In Figure 8 of |25j 
a set T of two trees is given, one of which is non-binary, such that r(Cl(T)) < h + (T). A result 
showing that Cass computes h + was thus already excluded. 

Observation 9. h (T) = h~(T~) for all sets T of not necessarily binary trees on the same taxa 
set X. 

Proof. The fact that h°(T) < h~(T) follows immediately from the definitions. Suppose by way of 
contradiction that there exists a set of not necessarily binary trees T such that h~(T) > h°(T). 
Let iV be any network with reticulation number h (T) that displays some refinement of each tree 
in 7~. Now, it is not too difficult to see (using for example the transformation described in Lemma 
2 of [35]) that we can create a binary network N' such that r(N') = r(N) and such that N' 
displays a binary refinement of each of the trees in T, yielding a contradiction. □ 

Given a set of trees T, recall that we define r tr (l~) to be the minimum reticulation number 
of any network that displays Tr(T) (i.e. all the rooted triplets in the input trees). We have the 
following result: 

Theorem 7. Let T = {T±,T2} be two not necessarily binary trees on X . Then h°(T) = rt r (T). 

Proof. Obviously, for any set T of not necessarily binary trees on X, r tr (T) < h°(T). It remains to 
show that this inequality is always tight. Suppose that it is not always tight. Let then T = {T\, T2} 
be two smallest (in terms of the size of \X\ = n) trees such that h°(T) > r tr (T). Clearly n > 2. 
Now, let N tr be any network that is consistent with Tr(T) = R such that r{N tr ) — r tr (R). 
If rt r (R) — we have a contradiction because this only occurs if T± and T2 have a common 
refinement, in which case h°(T) — 0. Hence rt r (R) > 0. This means that N tr has at least one SBR 
with taxa set S. Now, we claim that, for each i € {1, 2}, S is the set of taxa reachable from an edge 
ei (in Tj) or as the set of taxa reachable from some subset of the children of some node it, (in Tj). 
If this is not so then there exists some triplet xy\z in R such that x (jL S and y, z G S. However, 
Ntr cannot be consistent with xy\z since S is the taxa set of a SBR, which by definition sits below 
a cut-edge, yielding a contradiction. (If exists assume without loss of generality that Ui is its 
head). Let Q' = {v%, . . . , Vk} be the set of children of Ui such that for each v £ Q' , X(T V ) contains 
at least one element of S. Now let T( be the tree T t \T V1 . . . \ T Vh and let V = {T{, T^j.lt is clear 
that r tr (T') < r tr (R) — 1. By the assumption of minimality on \X\ we have that h°(T') = r tr {T')- 
Now, let N* be a network with a minimum reticulation number that displays some refinement of 
each tree in T 1 ■ Observe that there exists a binary tree T$ on taxa set S such that Tg is a binary 



refinement of both T\\S and T^S, otherwise R\S would not be an SBR (i.e. it would contain at 
least one reticulation). 

But we can obtain a network with r(N*) + 1 reticulations that displays some refinement of 
T\ and X2 by hanging T5 back below a single new reticulation in N* such that the tails of the 
two reticulation edges extend the embeddings of the refinements of T[ and T' 2 in TV*. Hence 
h°(T) < r t (V) + 1 < r t (T) < h°{T), contradiction. □ 

The following lemma extends a result from [35] : 

Lemma 12. IfT consists of two not necessarily binary phylogenetic trees on the same set oftaxa, 
r tr (T) = h a (T)=r(Cl(T)) . 

Proof. In [35] it is proven that for a not necessarily binary set of trees T on the same set of 
taxa, r tr (T) < r(Cl(T)). Now, if a network displays some refinement of a not necessarily binary 
tree T, then it represents all the clusters in Cl(T)( and possibly more). Hence we also have that 
r(Cl(T)) < h°(T). Combining this with Theorem ffj gives the result. □ 

Combining all these results we finally obtain the following theorem. 

Theorem 8. Let T = {Ti,T2} be two not necessarily binary trees on X, and let C — Cl(T). 
When given C as input, Cass dc computes a network N such that r(N) = h°(T) = h~(J~). 

8 Conclusion and open problems 

The largest open problem emerging from this article is whether there exists a "reasonable" 
polynomial-time algorithm for constructing phylogenetic networks of bounded reticulation number 
(or level) that represent a given set of clusters. The result in Section |3j which shows a theoretical 
polynomial-time algorithm for constructing networks of bounded level, does not lend itself to a 
real- world implementation. On the other hand we have seen that for clusters obtained from binary 
trees a relatively simple and efficient algorithm can be used. At the moment however there is 
no reasonable polynomial-time algorithm for general cluster sets i.e. those obtained from sets of 
potentially non-binary trees. We have shown that Cass, which has a reasonable running time, is 
in general not optimal (although we had to explicitly engineer a highly synthetic counter-example 
to determine this). Cass is, however, an extremely greedy algorithm, in the sense that at every 
iteration it assumes that all maximal ST-sets are below cut-edges. Can we relax this assumption 
in some way to yield a slightly less greedy version of Cass that is optimal? A less urgent problem, 
but nevertheless very interesting, is the question whether Cass is optimal for clusters obtained 
from exactly three binary trees on X . 

A Appendix 

A.l Proofs deferred to the appendix 

For a tree-node u with set of children Q, we say that a Q'- refinement, where Q' C Q and \Q'\ > 2, 
is the network obtained by deleting all edges between u and elements of Q' , adding a new node u' , 
adding the edge (u, u'), and adding edges between u' and each element of Q' . We say that a tree- 
node can be refined if there exists some Q'-refinement of it. Note that if a network N represents 
a set of clusters C, then the network N' obtained by refining some tree-node of N still represents C. 

Observation [TJ Let C be a separating set of clusters on X. Let N be any network that rep- 
resents C. Then each node of N has at most one leaf child and for each cut-edge (u,v) in N, 
\X(v)\ = 1 or X(v) = X. 



Proof. Suppose N contains a cut-edge (it, v) such that 1 < | (f) | < \X\. Let C be any cluster 
represented by N. Then either CC\X(v) = 0, X(y) C C or C C Af(u), because (it, u) is a cut-edge. 
Hence A?(i>) is unseparated, contradiction. Now, suppose some node of N has two leaf-children 
x ^ y G X. Then every non-singleton cluster that contains x, also contains y, and vice-versa, 
meaning that {x,y} is an unseparated set, contradiction. □ 

Lemma [TJ Let C be a separating set of clusters on X. Let N be any network that represents C. 
Then there exists a simple network N* with at most one leaf-child per node such that £(N*) < £(N). 

Proof. Clearly Observation [T] holds for N. We will transform N to obtain N*. We repeatedly and 
arbitrarily apply any of the following four operations until any of them can be applied, (a) If N 
contains a node u that has indegree and outdegree both larger than 1 then create a new node u', 
add the edge (it, it') and move the tails of all edges that leave it, to it', (b) If N contains a cut-edge 
(it, v) such that X(v) — X, then keep all nodes and edges reachable from v by directed paths, take 
v as a new root, and discard the rest of the network, (c) If N contains a cut-edge (it, v) such that 
X(v) = {x} where x £ X, but v is not labelled by x, then delete all nodes and edges reachable 
from ii by directed paths (but not v), and label v with taxon x. (d) If some tree- node of N can be 
refined to create a cut-edge (it, it') such that it' is not a leaf, then do that. 

Note that after applying each operation the resulting network still represents C. Furthermore, 
and less obviously, the operations do not raise the level of the network. This is clear for the 
"deletion" operations (b) and (c) but for operations (a) and (d) the critical observation is that two 
biconnected components of N overlap on at most one node, and in particular are edge disjoint. Let 
N* be network obtained after no more operations (a)-(d) can be applied. Clearly, N* represents 
C, and Observation [I] applies to it. If N* is simple we are done. If N* is not simple then it 
must contain a cut-node it which (when removed) disconnects N* into two or more non-trivial 
components, because iV* does not contain any cut-edges with this property by Observation [T] Let 
P = {pi, . . . ,Pi} (i > 0) be the parent nodes of u and let Q — {qi, . . . ,qj} (j > 1) be the children 
of it. For a node v ^ it let R u (v) be the set of nodes connected to v by an undirected path, after 
deletion of node it. We know that deleting u splits N* into at least two non-trivial components i.e. 
components that contain at least one edge. Observe that each such component will be the union 
of one or more of the (i + j) R u (v) sets obtained by taking v £ P U Q. Secondly, it is useful to 
note that R u {pi) = Ru{p2) = ... = R u {Pi)- This follows because in a phylogenetic network every 
node can be reached from the root by some directed path, and hence the parents of it can still 
reach each other with undirected paths (via the root if necessary) after deletion of it. So at least 
one of the non-trivial components created by deletion of u is equal to U v£ qi R u (v) where Q' C Q. 
Consider the case that Q' = Q. Then we cannot have that i = 0, because this would mean that u 
is not a cut-node. So i > 1. If i = 1 then we conclude that (pi,u) is a cut-edge in N* and thus that 
it is a leaf, contradicting the assumption that j > 1. If i > 2 then j — 1 (because all reticulations 
in N* have outdegree 1) and hence we can conclude that (it, q\) was a cut-edge in N* . This means 
that in N* , qi is a leaf labelled by a single taxon and hence that R u (qi) is a trivial component. 
But this contradicts the fact that at least two non-trivial components are created by deletion of 
it. 

Consider finally the case that Q' C Q. Suppose \Q'\ = 1 and let (wlog) qi be the only element 
of Q'. Then the edge (it, qi) is actually a cut-edge in N* , meaning that qi is a leaf labelled by a 
single taxon and (again) that Ru(qi) is a trivial component, contradiction. Hence \Q'\ > 2 and 
j > 3. Note that we can construct a new network N** which still represents C by performing a 
Q'-rcfmcmcnt on it. Furthermore, the edge (it, it') created by the refinement must be a cut-edge. 
But then we could have applied a type-(d) operation to N* , contradicting the assumption that no 
more applications of operations (a)-(d) were possible. 

Hence we conclude that N* is a simple network with at most one leaf-child per node that 
represents C and, because the four operations (a)-(d) do not increase level, £(N*) < £(N). □ 

Lemma [2} Let N be a phylogenetic network on X. Then we can transform N into a binary 
phylogenetic network N' such that N' has the same reticulation number and level as N and all 
clusters represented by N are also represented by N' . 



Proof. The following transformation is similar to one described in Lemma 2 of [25 . (a) For each 
reticulation node u with outdegree 2 or higher, introduce a new node v! , add an edge (u, v!) and 
move the tails of edges that leave u, to v! . This ensures that all reticulation nodes have outdegree 
1. (b) For each reticulation node u with indegree d (d > 3), we can replace u with a chain of (d— 1) 
reticulation nodes of indegree 2. For example, if u has 3 parents Pi,p 2 ,P3 and child qi we delete 
U, add 2 new nodes ui,u 2 and new edges (pi, ui), (p2, Ui), {u\, u^), (p3, u 2 ), {u 2 , <7i). To see that 
steps (a) and (b) does not increase the level of the network recall that biconnected components 
intersect on at most one node, and (as observed in the proof of Lemma [l]) a reticulation node and 
all its parents are always in the same biconnected component. 

(c) The only nodes we still need to consider are tree-nodes. Let u be a tree-node with outdegree 
d (d > 3). If at least one child of u is a leaf, but not all children of u are leaves, let qi be a leaf-child 
of u and q 2 be a non-leaf child of u. In this case we delete the edges (u,qi), (u,q 2 ), add a new 
node u' , and add new edges (u, u'), (u', qi ),(«', ga)- We repeat this process until all tree-nodes 
u with outdegree 3 or higher are such that either all their children are leaves, or none of their 
children are leaves. For each tree-node u with outdegree 3 or higher whose children are all leaves, 
let X' = X(u), delete all children Q of u and identify u with the root of an arbitrary binary tree 
on taxa set X' . Now, all remaining tree-nodes with outdegree 3 or higher will be such that none 
of their children are leaves. The main subtlety here is to avoid a situation where, by refining a 
tree-node in the wrong way, we merge two biconnected components and thus raise the level of the 
network. To avoid this we apply the following operation as often as possible: (d) Let u be any 
tree-node with outdegree 3 or higher such that there exists a biconnected component K which 
contains u and at least two, but not all, of the edges leaving u. Let Q' be the children of u that 
K contains. Perform a Q'-refinement on u. 

After no more applications of operation (d) are possible there is no danger of merging bicon- 
nected components by refining tree-nodes so we can apply the last refinement step: (e) we replace 
each remaining tree- node u with outdegree d (d > 3) by a chain of (d — 1) tree- nodes with outde- 
gree 2. For example, if u has 3 children qi,q 2l qz then we delete u, create two new nodes Ui,u 2 , 
add edges (ui,q±), (ui,U2), (u 2 ,q 2 ), (1*2, qs) and (if u had a parent) add an edge from the former 
parent of u to v! . This completes the transformation. □ 

Lemma [4| Let C be a set of clusters on X and let Si 7^ S2 be two ST-sets of C. If S\C\S 2 7^ 
then, Si U 5*2 is an ST-set. 

Proof. If Si C S 2 or S 2 C Si then we are done, so we assume neither such condition holds. Now, 
suppose that some cluster C G C is incompatible with Si^J S 2 . We will derive a contradiction. 
Note that C is compatible with Si and S 2 since both S\ and S 2 are ST-sets. Then we cannot have 
that C H S\ — or C C S\ because then C would be compatible with S\ U S 2 . It follows that 
Si C C. With the same reasoning we obtain that S 2 C C . So Si U S 2 C C . Hence C is compatible 
with Si U S 2 , a contradiction. It remains to show is that any pair of clusters in C|(Si U S2) are 
compatible. Consider any cluster C in C\{S\ U S2). Obviously, C C Si U S2 and there are three 
possibilities: (i) C = Si U S 2 , (ii) C C Si H S 2 , or (iii) C n (Si n S 2 ) = 0. To see this, suppose 
there exists a cluster C £ C\(Si U S2) that violates all three possibilities. Then, without loss of 
generality, there is by violation of (ii) some x € Si such that x (jL S 2 and x € C. There is also, by 
violation of (iii), some x' £ Si n S2 such that x' g C. This means that Si, S2 fl C ^ 0. Let C be 
a cluster in C such that C"|(Si U S2) = C. It cannot be that C C S2 (because of x) so S2 C C 
(because S2 is an ST-set in C and S2 fl C 7^ 0). Now, because (i) was violated it follows that there 
is some x" € Si such that x" £ C and thus x" £ C , so Si % C . Now, since S2 C C' and we 
assumed that S 2 % Si, we have that C % S\. But C fl Si 7^ 0, so C is incompatible with Si and 
Si is not an ST-set, a contradiction. Hence (i)-(iii) are the only possibilities. 

Now, suppose there are two clusters Ci,C 2 € C|(Si U S2) which are incompatible. Let C[ and 
C 2 be the clusters in C such that C[\{S\ U S 2 ) = C x and C' 2 \{S\ U S 2 ) = C 2 . Now, clearly type 
(i) clusters cannot cause incompatibilities. Furthermore, type (ii) and (iii) clusters are disjoint, so 
Ci,C 2 are both type (ii) clusters or are both type (iii) clusters. Suppose they are both type (ii) 
clusters i.e. both are entirely contained inside Si n S2. If C[ — Ci and C 2 = C 2 then we obtain a 



contradiction, since Si n 52 C Si. Specifically, we have that C'i,C 2 C Si and thus Si cannot be 
an ST-set. Hence, without loss of generality, we assume that C[ contains an element x £ Si U S2. 
But then C[ % Si, Si % C[ (because Si n S 2 C Si) and Si H C[ 7^ 0. Again, we conclude that 
Si is not an ST-set. The final case is that both Ci and C 2 are type (iii) clusters. Clearly either 
Ci, C 2 C Si \ S 2 or C\, C-x C S 2 \ Si, otherwise C\ and C 2 would be compatible. Assume without 
loss of generality that d,C 2 C Si \ S 2 . Since S x \ S 2 C S u if C[ = C x and C 2 = C* 2 then Si 
cannot be an ST-set, a contradiction. So assume C[ contains an element x S\ U S 2 . So g Si. 
Furthermore, Si g C[ because Si (1 S 2 7^ 0. But Si n C[ 7^ 0, so Si is not an ST-set, again a 
contradiction. □ 

Lemma [5} The maximal ST-sets of a set of clusters C on X can be computed in polynomial time.. 

Proof. Start with a set S of n singleton ST-sets. If there are two distinct ST-sets Si, S 2 £ S such 
that Si U S 2 is an ST-set, then remove Si and S 2 from S and add Si U S 2 to S. Repeat this until 
it is no longer possible. This results in a set S of ST-sets that partitions X. Let M. be the set of 
maximal ST-sets of C. We will show that S = A4. 

Observe that for each S <E S and each M £ M, either MnS = 0orSCM. Indeed, from 
Lemma |4j if some S and M were incompatible then SUM would be an ST-set too, contradicting 
the maximality of M. So each M € M. is partitioned by one or more elements from S. Now, suppose 
some S £ S is not maximal. Then there exists some M £ M. such that S C M. Furthermore, 
there must also exist some S' 7^ S such that S' C M i.e. M is partitioned by at least two elements 
from S. Hence we can write M — Si U . . . U S& where k > 2 and each S, € S. We denote by S' the 
set of ST-sets of S partitioning M. Now, consider the set of clusters C\M. All clusters in C\M are 
mutually compatible, because M is an ST-set. Furthermore, every cluster in C\M is compatible 
with every Si contained inside M because the Sj are ST-sets and S' partitions M. 

Now, for a cluster C £ S, let nb(C) be the number of Sj G C|M that it contains. Let 6 = 
min{nb(C)\C £ C\M and nb(C) > 2}. Suppose 6 is not defined. Let Si 7^ Sj be any two ST-sets 
of S' . We claim that Sj U Sj is an ST-set. First, we need to prove that Si U Sj is not separated by 
C. Let us suppose this is not true and there exists a cluster C £ C such that C n (Si U Sj) 7^ 0, 
C^S.U Sj and S l U S,- g C. It follows that C % S t and C g Sj. Moreover, since Si U S % C, 
C does not contain at least one among Sj and Sj, say Si. Then C fl Sj =0, otherwise Sj is not 
an ST-set. Then we have that C Pi Sj 7^ 0. If Sj % C we have again a contradiction. So we must 
have Sj c C and C n Si =0. But since C cannot contain more than one ST-set of S' , C separates 
M, a contradiction. It follows that Si U Sj is not separated by C. Moreover, since no cluster of 
C\M contains more that one ST-set of S, each cluster in C\M is entirely contained inside some 
single ST-set of S' . It follows that C|(S, U Sj) is a compatible set of clusters because in this case 
C|(Sj U Sj) = C\Si UC\Sj and S, H Sj = 0. Then Sj U Sj is an ST-set, contradicting the termination 
condition for the algorithm. 

Now, suppose that b is well defined. Let C be any cluster in C\M such that nb(C) — b. Let 
Si 7^ Sj be any two ST-sets in S' that are contained in C. We claim that Si U Sj is an ST-set. 
We first show that Si U Sj is unseparated. Suppose this is not true. Then there exists C £ C such 
that C n (Sj U Sj) 0, C % (Si U Sj) and (Sj U Sj) % C With the same argument as before 
we have (without loss of generality) that Sj C C and C n Si = 0. This means that M % C . 
Combined with the fact that C n M 7^ and that M is a maximal ST-set of C, we have that 
C C M and thus C" € C|M. C" and C are compatible (because all clusters in C\M are mutually 
compatible), so combining that C n C 7^ and that C g C", we have that C C C. Observe that 
C is partitioned by ST-sets from S' . Now, we cannot have that nb(C) < 1 because Sj is a strict 
subset of C . So 2 < 716(C) < b, contradiction. It remains only to prove that all the clusters in 
C|(Sj U Sj) are mutually compatible. Si U Sj is unseparated by C so for each C £ C we have that 
Cn(S,U Sj) = 0, Sj U Sj C C or C C Sj U Sj. For each CeC such that C C Sj U Sj we have 
either that C = Sj U Sj , C C Si or C C Sj (because C is compatible with both Si and Sj ) . Since 
Sj n Sj = 0, it is not possible for C|(Sj U Sj) to contain two incompatible clusters, and again we 
conclude that Si U Sj is an ST-set. 

□ 



Lemma Let N be a network that represents a set of clusters C. Let S be a non-trivial ST-set 
with respect to C. Then there exists a network TV' such that r(iV') < r(N), £(N') < i(N), S is 
under a cut-edge in TV' and for each ST-set S' such that S" fl S — and S' is under a cut-edge in 
TV, S' is also under a cut-edge in TV'. 

Proof. We obtain N' from TV by the following transformation. Let x be any element of S. Recall 
that leaves in TV always have indegree-1 and outdegree-O, so let (u,v) be the edge in TV such that 
v is labelled by x. This is trivially a cut-edge, (a) Delete in TV all taxa in S (but not the leaves 
they label, we will deal with this in step (c)). (b) Identify v with the root of a tree T$ on S 
that represents C\S. (c) Tidy up redundant parts of the network possibly created in step (a) by 
applying in an arbitrary order any of the following steps until no more can be applied: deleting any 
nodes with outdegree-O that are not labelled by a taxon; suppressing any nodes with indegree-1 
and outdegree-1; replacing any multi-edges with a single edge; deleting any node with indegree-0 
and outdegree-1. This concludes the transformation. Let TV' be the resulting network. To see that 
TV' represents C consider any C e C. If C C S then clearly some edge in the tree we added in step 
(b) represents C. Otherwise there are only two possibilities (because S is unseparated) : S(lC = 
or S C C. In either case we know that there exists some tree T on X which is displayed by TV and 
such that T represents C. By applying steps (a)-(c) simultaneously to T we obtain a new tree T" 
that is displayed by TV' and which represents C . The critical reason this works is that, if C % S, 
x 6 C (where x is the element we selected at the beginning of the proof) if and only if S C C. 
The tidying-up operations in step (c) clearly do not raise the reticulation number or the level of 
the network. Furthermore they leave untouched any taxa not in S. So any other ST-sets that were 
already under a cut-edge in TV, are also under a cut-edge in TV'. □ 

A. 2 Proof that Cass does not construct a simple level- < 3 network for the input C 
described in Figure [8} 

Let us begin with the following observation. 

Observation 10. If a phylogenetic network TV represents C , then r(TV) > 3. 

Proof. First, note that r{N) > 2. This follows because the clusters {8, 5}, {1, 5}, {5, 6} are all in 
C. A network with reticulation number r can display at most 2 r mutually non- isomorphic trees, 
and each tree displayed by the solution TV can represent at most one of those clusters. Now, 
suppose some network TV exists that represents C such that r(TV) = 2. All SBRs of TV are single 
leaves, because otherwise C would not be separating . Now, observe that at least one of the leaves 
{1,5,6,8} must be an SBR. Suppose this is not so, and let taxon i £ {1,5,6,8} be a single leaf 
SBR. But then removing leaf i from TV and tidying up the resulting network would give us a 
network TV' with r(TV') < 1 such that TV' displays the three clusters described at the beginning of 
the proof; contradiction. So suppose 1 is the SBR. If we remove taxon 1 then the remaining network 
TV' has r(TV') < 1 and represents {8, 5}, {5, 6, 7} and {3, 4, 5}, contradiction. If taxon 5 is the SBR 
then the same argument holds for {1, 2}, {3, 4, 1} and {7, 6, 1}. (In fact, taxon 5 is symmetrical to 
taxon 1). If taxon 6 is the SBR then clusters {5,8}, {5,7}, {1,3,4,5} cause the contradiction. If 
taxon 8 is the SBR then clusters {1, 3, 4}, {1, 2}, {1, 5} cause the contradiction. □ 

To prove the overall result we thus exhaustively consider all maximal ST-set tree sequences of 
length exactly 3. These sequences can either be derived by hand, or generated computationally. 
Many maximal ST-set sequences of length 3 do not remove all incompatibilities in the clusters i.e. 
they are not tree sequences. We can thus automatically exclude these sequences from our analysis 
(because Cass will immediately conclude that it is not possible to construct a simple level- < 3 
network that represents C from such a sequence). Additionally, many of the tree sequences can be 
excluded by using an argument similar to that used in Observation |10| In particular, if two maximal 
ST-sets have already been removed, but the input still contains three mutually incompatible 
clusters, then we discard this tree sequence. This is because any network that represents the 
remaining clusters (i.e. those obtained after removing the first two maximal ST-sets) must have 



reticulation number at least 2, and subsequently hanging back the first two maximal ST-sets 
below reticulations raises the reticulation number of the network to at least 4. In other words, 
all attempts of Cass to construct a simple level- < 3 network representing C using this maximal 
ST-set tree sequence are doomed to fail. 

It turns out that after applying these two filtering rulet[^] there only remains a very small 
number of cases to consider. These are : 

1. ({1},{5},0) 

2. ({8},{1},{5}) 

3. ({1},{8},{5}) 
4- ({5},{1},0) 

5. ({2},{5},{1}) 

6. ({5},{2},{1}) 

Note that, because of symmetries in the cluster set, sequence 4 is entirely symmetrical to se- 
quence 1, sequence 5 is entirely symmetrical to sequence 2 and sequence 6 is entirely symmetrical 
to sequence 3. Hence we can actually restrict our attention to only sequences 1-3. 




7 6 8 2 34 
Fig. 13. The "base tree" T in the case ({1},{5}, 



Case ({1},{5},0) 

We can see already that this is a potentially important tree sequence because it is closely linked to 
the network shown in Figure [9]ja). Specifically: remove taxon 1, causing the reticulation number to 
drop by 1, and then remove taxon 5 which causes the reticulation number to drop by 2 (because 
one of the two remaining reticulations is a child of the other). Here we show why Cass nevertheless 
cannot reconstruct the network. 

Before removing taxon 1 the maximal ST-sets are {i}, 1 < i < 8. After removing taxon 1 
the maximal ST-sets are {2}, {3,4}, {5}, {6}, {7}, {8}. {3,4} is a (new) non-singleton maximal 
ST-set so the Cass algorithm collapses taxa 3 and 4 into a new meta-taxon, let us call this 
34 for simplicity. After collapsing the maximal ST-sets are thus {2}, {34}, {5}, {6}, {7}, {8}. 
Subsequently taxon 5 is removed and there are no more incompatible clusters in the input. We 
construct the unique tree T that represents exactly the remaining clusters (and add a dummy 



root): see Figure 13 For each taxon i in T let e, be the edge in T that feeds into it. 

Note how meta-taxon 34 is still collapsed. The constructive phase of Cass will now proceed as 
follows: (a) try and hang back a dummy taxon (i.e. the empty set 0) from T below a reticulation; 



The automated case-analysis can be downloaded from http://skelk.sdf-eu.org/clustistic/ 
caseanalysis . txt 



(b) try and hang taxon 5 back below a second reticulation; (c) decollapse meta-taxon 34; (d) try 
and hang taxon 1 back below a third reticulation. Now, after decollapsing meta-taxon 34 we have 
a network N' on taxa set {2, 3, 4, 5, 6, 7, 8} with a cherry on taxa {3, 4}. We argue that in steps (a) 
and (b) certain edges of T will definitely have been subdivided by the tails of reticulation edges. 
Note that e 6 has definitely been subdivided to make cluster {5, 6} possible. Similarly edge e 8 will 
have been subdivided to make cluster {5,8} possible. And edge 634 will have been subdivided to 
make cluster {5, 34} possible. Steps (a) and (b) can subdivide at most four edges of T, but observe 
that if four edges are subdivided then one of the two reticulations in N' wil not have been "used" 
i.e. there will only hang a dummy taxon beneath it. Hence N' cannot simultaneously represent 
{5,6}, {5,8} and {5,3,4}, because a network with reticulation number at least 2 is required to 
resolve these incompatibilities. Steps (c) and (d) cannot remedy this so we can exclude the case 
that four edges of T have been subdivided. Hence we can safely conclude that edge e2 of T has 
not yet been subdivided in N' . 

Consider step (d). We have to hang back taxon 1 beneath a single reticulation i.e. a reticulation 
with two incoming edges. Observe that at least one of those two reticulation edges subdivides (i.e. 
starts from) the edge entering taxon 3, otherwise the cluster {1,3} € C cannot be represented. 
Now, note also that {1, 2} g C. To obtain this cluster the edge e-i has to be subdivided, because 
otherwise every cluster that contains both 1 and 2 will also contain 3 and 4. But if both the two 
reticulation edges are used in this way, then any cluster that contains a taxon from {6, 7, 8} and 
contains taxon 1, must also contain {2,3,4}. This is not satisfactory however, because cluster 
{1, 5, 6, 7} is in C. In other words: at least three reticulation edges are needed to hang back taxon 
1, yielding a contradiction. 

It will be clear from this example that the problem with Cass is that it collapses maximal 
ST-set {3,4}, forcing the iteration in which taxon 1 is hung back to use at least 3 reticulation 
edges instead of 2. 




2 34 



Fig. 14. The "base tree" T for the cases ({8}, {!}, {5}) and ({!}, {8}, {5}) 



Case ({8},{1},{5}) 

Let T be the tree obtained after removing these three maximal ST-sets. After removing the 
maximal ST-set {1} taxa 3 and 4 will, as in the previous case, have been collapsed into the meta- 



taxon 34. Hence T will look as in Figure 14 As in the previous case let be the edge feeding 
into taxon i. Observe that to hang back maximal ST-set {5} it is essential to subdivide eg and 
634; this is the only way to ensure that the clusters {5,6} and {5,34} are represented. Let N' be 
the network obtained by doing this and subsequently expanding meta-taxon 34 into a cherry on 
taxa {3,4}. Now, we need to hang maximal ST-set {1} back from N' increasing the reticulation 
number exactly by one. Note that in any case the edge entering taxon 3 in N' will have to be 
subdivided (to represent the cluster {1,3}), and also the edge entering taxon 2 will need to be 
subdivided (to represent the cluster {1, 2}). However, the cluster {1, 5, 6, 7} will then definitely not 



be represented. Hence we conclude that actually at least three reticulation edges are required to 
hang back {1}, and thus that it is impossible to hang back {1} increasing the reticulation number 
exactly by one. 

Case ({1},{8},{5}) 

The case has the same base tree T as the case ({8}, {1}, {5}), and again in this case there is 
only one way to hang back maximal ST-set {5} as an SBR, yielding the same network N' . Now, 
we first need to hang back maximal ST-set {8} in such a way that the reticulation number rises by 
exactly 1. Let I be the child of the real root of N' which lies on a directed path from the real root 
to taxon 6. We say that an edge of N' is a left edge if there is a directed path from the real root 
that contains both the tail and head of the edge and which also passes through I. Observe that 
when hanging back {8} at least one left edge of N' has to be subdivided by a reticulation edge, 
otherwise the cluster {5, 6, 7, 8} will not be represented by the resulting network N". Suppose the 
second reticulation edge (when hanging back {8}) subdivided neither the edge entering taxon 2, 
nor the edge entering taxon 3, in N' . But then, when subsequently hanging back {1} from N" , 
both these edges will have to be subdivided (to ensure that the clusters {1,2} and {1,3} are rep- 
resented), and such a network cannot possibly represent the cluster {1, 5, 6, 7}. Suppose then that, 
when hanging back {8}, the edge entering taxon 2 in N' was subdivided by a reticulation edge e. 
But when hanging back {1} the edge entering taxon 3 in N" will definitely have to be subdivided, 
and also the edge entering taxon 2 or (alternatively) e. But, again, the cluster {1,5,6,7} is not 
represented by such a network. An essentially identical argument applies if, when hanging back 
{8}, only the edge entering taxon 3 was subdivided. Hence hanging back {8} and then {1} causes 
the reticulation number to rise by at least 3, rendering it impossible to construct a simple level- < 3 
network. 
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